
By the end of this lesson, students should be able to:
Explain why video is a fundamentally different substrate for reasoning than static images
Describe the diffusion denoising process well enough to follow how reasoning emerges within it
Contrast the Chain-of-Frames (CoF) hypothesis with the Chain-of-Steps (CoS) mechanism and articulate the evidence for CoS
Identify the three emergent reasoning behaviors in video diffusion models (working memory, self-correction, perception before action) and draw parallels to LLM reasoning phenomena
Describe how Diffusion Transformer layers specialize into perceptual, reasoning, and consolidation roles
Understand the training-free latent ensemble strategy as a proof-of-concept for exploiting these mechanisms
Know about the limitations in current methodologies for testing vision-language models
Work on your final project
YouTube video for Demystifing Video Reasoning
Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang (2026) Demystifing Video Reasoning. https://arxiv.org/abs/2603.16870
Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg (2025) Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning. https://arxiv.org/abs/2506.00318
MIRAGE: The Illusion of Visual Understanding. Mohammad Asadi, Jack W. O'Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, Euan Ashley (2026). https://arxiv.org/abs/2603.21687