Topic 15: Visual Reasoning

LLMs watching film

Learning Goals

By the end of this lesson, students should be able to:

Explain why video is a fundamentally different substrate for reasoning than static images
Describe the diffusion denoising process well enough to follow how reasoning emerges within it
Contrast the Chain-of-Frames (CoF) hypothesis with the Chain-of-Steps (CoS) mechanism and articulate the evidence for CoS
Identify the three emergent reasoning behaviors in video diffusion models (working memory, self-correction, perception before action) and draw parallels to LLM reasoning phenomena
Describe how Diffusion Transformer layers specialize into perceptual, reasoning, and consolidation roles
Understand the training-free latent ensemble strategy as a proof-of-concept for exploiting these mechanisms
Know about the limitations in current methodologies for testing vision-language models

Tasks

Work on your final project

Lesson Plan and Slides

YouTube video for Demystifing Video Reasoning

Visual reasoning lesson plan

Visual reasoning slides

Mirage lesson plan

Mirage slides

Papers

Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang (2026) Demystifing Video Reasoning. https://arxiv.org/abs/2603.16870

Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg (2025) Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning. https://arxiv.org/abs/2506.00318

MIRAGE: The Illusion of Visual Understanding. Mohammad Asadi, Jack W. O'Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, Euan Ashley (2026). https://arxiv.org/abs/2603.21687