Survey Frames Video MLLMs as Watch, Remember, Reason
A new arXiv survey organizes human-view video understanding with MLLMs around three core abilities—watching, remembering, and reasoning—while covering egocentric applications alongside challenges in perception, memory, and faithful inference.












