Lesson Plan: Visual Reasoning over Video — Chain-of-Steps in Diffusion Models

Module: Vision-Language Models Segment: From Images to Video Reasoning Duration: ~50 minutes Format: Lecture with discussion Prerequisites: CLIP, LLaVA, ViT patch embeddings, basic exposure to diffusion models Primary Source: Wang et al., "Demystifying Video Reasoning," arXiv:2603.16870, March 2026


Learning Goals

By the end of this lesson, students should be able to:


1. Bridge: From Image Tokens to Video (5 min)

Key idea: We have built up the machinery for turning a single image into tokens and reasoning over it. Video extends this into the temporal dimension, which introduces both new structure and new challenges.

Quick Recap

What Changes with Video

Why Video for Reasoning?


2. Diffusion Review: The Denoising Process (8 min)

Key idea: Diffusion models generate data by learning to reverse a noise-addition process. Understanding this iterative denoising is essential because the paper's core finding is that reasoning happens along the denoising steps, not across frames.

The Forward Process (Adding Noise)

The Reverse Process (Denoising = Generation)

Why This Matters for Reasoning


3. The Diffusion Transformer (DiT) Architecture (5 min)

Key idea: The backbone of modern video generation models is the Diffusion Transformer — a transformer that operates on latent video tokens during each denoising step.

Architecture Overview

Conditioning


4. The Prevailing Hypothesis: Chain-of-Frames (3 min)

Key idea: Before this paper, the leading explanation for video reasoning was that it works like a chain of thought — but across frames.

The CoF Hypothesis (Tong et al., NeurIPS 2025)

Why CoF Is Misleading


5. Chain-of-Steps: Reasoning Along Denoising (8 min)

Key idea: Reasoning in video diffusion models primarily emerges along the denoising steps, not across frames. The model explores multiple candidate solutions early and progressively converges to a final answer.

The Core Discovery

Two Modes of Step-wise Exploration

Multi-Path Exploration — In tasks involving navigation or discrete choices, the model explicitly generates multiple candidate solutions in early steps, then eliminates them:

This resembles Breadth-First Search or Tree-of-Thoughts — but it arises naturally from the diffusion process, without any explicit search algorithm.

Superposition-based Exploration — In tasks involving spatial transformation or pattern completion, the model overlays mutually exclusive hypotheses:

This mode is reminiscent of quantum superposition — multiple states coexist until the denoising process "collapses" the representation to a single outcome.

Perturbation Evidence

The authors provide quantitative evidence via controlled noise injection:

This asymmetry is the smoking gun: the diffusion step dimension carries the reasoning, not the frame dimension.

Further analysis with CKA dissimilarity shows that perturbations in early steps propagate through the entire trajectory, while perturbations in later steps have limited impact. Sensitivity peaks around steps 20–30, when the model has committed to a reasoning trajectory and disruptions can derail a nearly-finalized solution.


6. Emergent Reasoning Behaviors (8 min)

Key idea: Video diffusion models exhibit reasoning behaviors strikingly parallel to those discovered in LLMs — but arising from a completely different architecture and training objective.

6.1 Working Memory

Parallel to LLMs: This is analogous to how LLMs maintain context in their hidden states across token positions — but here it operates across denoising steps rather than sequence positions.

6.2 Self-Correction and Enhancement

Parallel to LLMs: This is functionally analogous to the internal backtracking and self-correction observed in long-thinking LLMs (e.g., o1-style "wait, let me reconsider" reasoning). The key difference: in video models, corrections happen globally across all frames simultaneously within a single denoising step, providing strong evidence against frame-sequential reasoning.

6.3 Perception Before Action

Parallel to LLMs: This echoes the "let me first understand the problem" preamble that reasoning LLMs often produce before solving, and the perception-action divide in embodied AI systems.

A Biological Parallel

The paper draws a striking analogy to neuroscience: when a rat plans a path to food, researchers observe multiple simulated trajectories being rolled out in the hippocampus during the planning phase before the animal moves. The diffusion model's multi-path exploration during early denoising steps may be performing an analogous form of latent simulation.


7. Layer-wise Analysis: Functional Specialization in DiTs (5 min)

Key idea: Within a single denoising step, different transformer layers serve distinct computational roles — and this specialization emerges from training, not from architectural constraints.

Token-Level Activation Analysis

Latent Swapping Experiment

The Takeaway

The DiT self-organizes into a three-stage pipeline within each denoising step:

  1. Perceive (early layers): build a scene representation

  2. Reason (middle layers): execute the logical operation

  3. Consolidate (late layers): prepare the updated latent

This is reminiscent of how different brain regions specialize for perception vs. executive function — but here it emerges purely from the training objective of video generation.


8. Training-Free Ensemble: Exploiting CoS for Better Reasoning (5 min)

Key idea: Understanding how reasoning works enables practical improvements. The authors demonstrate a simple proof-of-concept that improves reasoning performance without any additional training.

The Intuition

The Method

  1. Run three independent forward passes of the same model with different initial noise seeds.

  2. At the first denoising step (s=0), extract hidden representations from transformer layers 20–29 (the reasoning-active middle layers identified in Section 7).

  3. Average the latent representations across the three runs within this layer range.

  4. Continue denoising normally from the averaged latent.

This is inspired by Model Soup (Wortsman et al., 2022), which showed that averaging model weights within the same optimization basin produces better generalization. Here, the idea is applied to latent trajectories rather than model parameters.

Results

Why This Matters


9. The Bigger Picture and Open Questions (3 min)

Video as a Reasoning Substrate

Connections to the Course

Open Questions for Discussion


Slide Deck Outline (for reference)

  1. Title slide — Visual Reasoning over Video: Chain-of-Steps in Diffusion Models

  2. Bridge — From image tokens (CLIP/LLaVA) to video: what changes

  3. Diffusion review — Forward/reverse process, the denoising trajectory

  4. DiT architecture — Spatiotemporal patches, bidirectional attention, 14B parameters

  5. The CoF hypothesis — Prior assumption: reasoning unfolds frame-by-frame

  6. CoS discovery — Reasoning unfolds along denoising steps; decoded intermediate latents

  7. Multi-path exploration — Maze solving, tic-tac-toe, robot navigation examples

  8. Superposition exploration — Pattern completion, rotation examples

  9. Perturbation evidence — Noise-at-step vs. noise-at-frame; CKA dissimilarity

  10. Emergent behavior 1: Working memory — Object permanence across denoising steps

  11. Emergent behavior 2: Self-correction — "Aha moments," ball bounce, 3D rotation

  12. Emergent behavior 3: Perception before action — Two-phase protocol, neuroscience parallel

  13. Layer specialization — Perceive → Reason → Consolidate pipeline within each step

  14. Latent swapping experiment — Causal evidence for middle-layer reasoning

  15. Training-free ensemble — Multi-seed latent averaging in reasoning layers

  16. Results — VBVR-Bench scores, comparison with proprietary models

  17. The bigger picture — Video as reasoning substrate, connections to CoT and world models

  18. Discussion questions — Open problems and future directions