Module: Vision-Language Models Segment: From Images to Video Reasoning Duration: ~50 minutes Format: Lecture with discussion Prerequisites: CLIP, LLaVA, ViT patch embeddings, basic exposure to diffusion models Primary Source: Wang et al., "Demystifying Video Reasoning," arXiv:2603.16870, March 2026
By the end of this lesson, students should be able to:
Explain why video is a fundamentally different substrate for reasoning than static images
Describe the diffusion denoising process well enough to follow how reasoning emerges within it
Contrast the Chain-of-Frames (CoF) hypothesis with the Chain-of-Steps (CoS) mechanism and articulate the evidence for CoS
Identify the three emergent reasoning behaviors in video diffusion models (working memory, self-correction, perception before action) and draw parallels to LLM reasoning phenomena
Describe how Diffusion Transformer layers specialize into perceptual, reasoning, and consolidation roles
Understand the training-free latent ensemble strategy as a proof-of-concept for exploiting these mechanisms
Key idea: We have built up the machinery for turning a single image into tokens and reasoning over it. Video extends this into the temporal dimension, which introduces both new structure and new challenges.
In CLIP, a Vision Transformer (ViT) splits an image into patches, projects each patch into an embedding, and learns a joint image-text representation space via contrastive learning.
In LLaVA, image patch embeddings are projected into the language model's token space and interleaved with text tokens, enabling the LM to "see" and reason over visual content.
These approaches treat each image as a bag of spatial tokens. The model sees one frozen moment.
A video is a sequence of frames, each of which can be tokenized the same way. But the temporal dimension introduces continuity, causality, and dynamics: objects move, occlude each other, interact, and transform over time.
Naive approach: tokenize each frame independently and concatenate. This produces an enormous token sequence (e.g., 24 frames × 256 patches = 6,144 tokens for a one-second clip) and gives the model no structural bias toward temporal coherence.
Modern video models don't work this way. Instead, the dominant paradigm is diffusion-based video generation, where the model learns to produce entire videos by iteratively denoising a latent representation of the full clip. This is where today's lesson begins.
Video provides a spatiotemporally consistent environment: objects persist, physics roughly applies, cause precedes effect. These are exactly the priors that reasoning requires.
Recent work has discovered that diffusion-based video models — trained only to generate video, with no explicit reasoning objective — exhibit non-trivial reasoning capabilities: solving mazes, playing tic-tac-toe, predicting physical trajectories, and completing visual patterns.
The central question: how does reasoning emerge in these models? Today's paper provides the first systematic answer.
Key idea: Diffusion models generate data by learning to reverse a noise-addition process. Understanding this iterative denoising is essential because the paper's core finding is that reasoning happens along the denoising steps, not across frames.
Start with a clean data sample x₀ (in our case, the latent representation of a video).
Gradually add Gaussian noise over a schedule of steps, producing progressively noisier versions x₁, x₂, ... until the signal is completely destroyed and the result is pure noise.
Mathematically, with flow matching (the formulation used by the models in this paper): x_s = (1 − s)x₀ + s·x₁ where x₁ ~ N(0, I) is noise and s goes from 0 (clean) to 1 (pure noise).
A neural network learns to predict, at each step, how to remove the noise: given a noisy x_s, estimate the velocity field v_θ(x_s, s, c) conditioned on a prompt c.
Generation works by starting from pure noise and iteratively applying the learned denoiser: each step moves the latent a little closer to a clean, coherent video.
The estimated clean state at any step can be computed: x̂₀ = x_s − σs · vθ(x_s, s, c). This is the model's current "best guess" at what the final video looks like. Crucially, this estimate can be decoded and visualized at every step, giving us a window into the model's intermediate reasoning.
The denoising process is not just cleaning up noise. Each step is a computational pass through a large neural network (in this case, a Diffusion Transformer with ~14 billion parameters). The model has the opportunity to revise its decisions at every step.
Think of it this way: if a language model gets N forward passes to produce a token, a diffusion model gets T denoising steps to produce an entire video, and at each step it can see and revise the full output.
This creates a unique regime for emergent computation that has no direct analogue in autoregressive text generation.
Key idea: The backbone of modern video generation models is the Diffusion Transformer — a transformer that operates on latent video tokens during each denoising step.
The video is first encoded into a compressed latent space by a Variational Autoencoder (VAE), reducing spatial and temporal resolution.
The latent is then patchified: divided into spatiotemporal patches, each projected into a token embedding. This is directly analogous to the ViT patch embedding from our CLIP/LLaVA lessons, but extended to 3D (space + time).
A transformer processes these tokens with full bidirectional attention — every token can attend to every other token across all frames and all spatial locations. This is a critical difference from autoregressive LLMs, which can only attend to preceding tokens.
The model in the paper is VBVR-Wan2.2, fine-tuned from Wan2.2-I2V-A14B: a 14-billion-parameter image-to-video DiT with 40 transformer layers and embedding dimension 5,120.
Text prompts are encoded and injected into the transformer via cross-attention, guiding the denoising process. An input image conditions the first frame.
Classifier-free guidance (CFG) is used at inference: the model runs two forward passes per step (one conditioned, one unconditional) and interpolates, strengthening adherence to the prompt.
Key idea: Before this paper, the leading explanation for video reasoning was that it works like a chain of thought — but across frames.
Prior work observed that video models solving reasoning tasks (e.g., maze navigation) appear to show progressive solution development across the video's temporal axis: early frames show the starting state, middle frames show intermediate progress, and the final frame shows the answer.
This led to the Chain-of-Frames (CoF) analogy: just as an LLM reasons step-by-step through tokens, a video model reasons step-by-step through frames.
The analogy is appealing because it maps directly onto Chain-of-Thought from the LLM literature. But is it correct?
The DiT uses bidirectional attention over all frames simultaneously. Unlike an autoregressive model, there is no causal constraint forcing frame N+1 to depend only on frames ≤ N.
What looks like temporal progression in the final video may actually be a post-hoc rendering of a solution that was worked out during the denoising process, not across frames.
The paper provides direct evidence against CoF through both qualitative observation and controlled perturbation experiments.
Key idea: Reasoning in video diffusion models primarily emerges along the denoising steps, not across frames. The model explores multiple candidate solutions early and progressively converges to a final answer.
By decoding the estimated clean latent x̂₀ at each denoising step (not each frame, but each iteration of the denoiser), the authors can watch the model "think."
What they observe is striking: in early denoising steps, the model entertains multiple possibilities simultaneously. As denoising progresses, it prunes alternatives and converges to a single solution.
They term this Chain-of-Steps (CoS): reasoning unfolds along the diffusion trajectory, not the temporal axis of the video.
Multi-Path Exploration — In tasks involving navigation or discrete choices, the model explicitly generates multiple candidate solutions in early steps, then eliminates them:
Maze solving: early steps show a "probabilistic cloud" of multiple plausible paths; later steps suppress incorrect routes, converging on the correct one.
Tic-tac-toe: early steps simultaneously highlight multiple candidate cells for a winning move before committing.
Robot navigation: both upper and lower routes through a maze are visible in early steps; one gradually becomes dominant.
Object movement: four candidate trajectories for placing a plant on a shelf collapse to one correct position.
This resembles Breadth-First Search or Tree-of-Thoughts — but it arises naturally from the diffusion process, without any explicit search algorithm.
Superposition-based Exploration — In tasks involving spatial transformation or pattern completion, the model overlays mutually exclusive hypotheses:
Size pattern completion (large-medium-small cycle): early steps show overlapping circles of different sizes, representing competing hypotheses about the correct continuation.
Object rotation: rather than committing to one angle, early steps show a blurred superposition of several candidate orientations.
This mode is reminiscent of quantum superposition — multiple states coexist until the denoising process "collapses" the representation to a single outcome.
The authors provide quantitative evidence via controlled noise injection:
Noise at Step: inject Gaussian noise into all frames at a single denoising step. Result: performance collapses from 0.685 to below 0.3. Reasoning is destroyed.
Noise at Frame: inject Gaussian noise into a single frame across all denoising steps. Result: much smaller performance drop. The model recovers the corrupted frame via bidirectional attention from neighboring frames.
This asymmetry is the smoking gun: the diffusion step dimension carries the reasoning, not the frame dimension.
Further analysis with CKA dissimilarity shows that perturbations in early steps propagate through the entire trajectory, while perturbations in later steps have limited impact. Sensitivity peaks around steps 20–30, when the model has committed to a reasoning trajectory and disruptions can derail a nearly-finalized solution.
Key idea: Video diffusion models exhibit reasoning behaviors strikingly parallel to those discovered in LLMs — but arising from a completely different architecture and training objective.
Reasoning requires maintaining a persistent state across computation steps. The diffusion process naturally establishes "anchors" that preserve critical information.
Example — object reappearance: when asked to move an object out of frame and back, the model preserves the object's original position throughout the denoising process, enabling consistent return.
Example — occlusion handling: when a large teddy bear is moved across a smaller one, early denoising steps retain the state of the occluded bear, ensuring it reappears correctly. This demonstrates a form of object permanence.
Parallel to LLMs: This is analogous to how LLMs maintain context in their hidden states across token positions — but here it operates across denoising steps rather than sequence positions.
The model can recover from incorrect intermediate solutions. The authors observe "aha moments" where the model initially selects a wrong option but revises its reasoning after a few more denoising steps.
Example — ball bounce prediction: early steps produce an incomplete, ambiguous trajectory; later steps gradually complete and correct it, converging from four candidate landing points to one.
Example — 3D shape rotation: initial steps generate cubes with incorrect quantities and arrangements; subsequent steps correct both count and spatial configuration.
Parallel to LLMs: This is functionally analogous to the internal backtracking and self-correction observed in long-thinking LLMs (e.g., o1-style "wait, let me reconsider" reasoning). The key difference: in video models, corrections happen globally across all frames simultaneously within a single denoising step, providing strong evidence against frame-sequential reasoning.
The model follows a consistent two-phase protocol: early denoising steps establish what and where (identify the relevant objects), while later steps determine how and why (execute reasoning and manipulation).
Example — "get the car running": early steps identify and localize the car; later steps introduce motion and simulate physical interaction.
Example — "correct the incorrect parts of the house": early steps identify the door as the target object; later steps manipulate it.
Parallel to LLMs: This echoes the "let me first understand the problem" preamble that reasoning LLMs often produce before solving, and the perception-action divide in embodied AI systems.
The paper draws a striking analogy to neuroscience: when a rat plans a path to food, researchers observe multiple simulated trajectories being rolled out in the hippocampus during the planning phase before the animal moves. The diffusion model's multi-path exploration during early denoising steps may be performing an analogous form of latent simulation.
Key idea: Within a single denoising step, different transformer layers serve distinct computational roles — and this specialization emerges from training, not from architectural constraints.
The authors visualize the L2 norm of hidden states (a proxy for "activation energy") at each layer, for each spatial-temporal token.
Early layers (0–9): attend primarily to global structures and background context.
Middle layers (~10–29): attention shifts to foreground entities specified in the prompt. Reasoning-related features emerge — activations correlate with object motion and interactions.
Late layers (30–39): consolidate the latent representation, preparing it for the next denoising step.
To prove this isn't just correlational, the authors run a causal experiment: during a controlled object recognition task, they swap the latent representation at a single layer between two different inputs (one with cats, one with bicycles).
Swapping at layer 20–21 causes the model's output to completely flip — e.g., it circles cats instead of bicycles.
This confirms that middle layers encode the semantically decisive information that governs the reasoning outcome.
The DiT self-organizes into a three-stage pipeline within each denoising step:
Perceive (early layers): build a scene representation
Reason (middle layers): execute the logical operation
Consolidate (late layers): prepare the updated latent
This is reminiscent of how different brain regions specialize for perception vs. executive function — but here it emerges purely from the training objective of video generation.
Key idea: Understanding how reasoning works enables practical improvements. The authors demonstrate a simple proof-of-concept that improves reasoning performance without any additional training.
Since the model naturally explores multiple reasoning paths in early denoising steps, we can encourage richer exploration by running multiple copies of the model with different random seeds and combining their intermediate representations.
Different seeds will explore different candidate solutions. Averaging their latent representations in the reasoning-active layers effectively implements a "vote" across diverse reasoning trajectories.
Run three independent forward passes of the same model with different initial noise seeds.
At the first denoising step (s=0), extract hidden representations from transformer layers 20–29 (the reasoning-active middle layers identified in Section 7).
Average the latent representations across the three runs within this layer range.
Continue denoising normally from the averaged latent.
This is inspired by Model Soup (Wortsman et al., 2022), which showed that averaging model weights within the same optimization basin produces better generalization. Here, the idea is applied to latent trajectories rather than model parameters.
Evaluated on VBVR-Bench, a comprehensive video reasoning benchmark covering abstract reasoning, knowledge, perception, spatial reasoning, and transformation tasks.
The ensemble improves overall in-domain score from 0.685 → 0.716 and out-of-domain from 0.610 → 0.650 over the strong VBVR-Wan2.2 baseline.
Notably, the largest gains appear on out-of-domain tasks, suggesting the ensemble helps the model generalize its reasoning rather than just memorizing solutions.
For context, the best proprietary model (Sora 2) scores 0.546 in-domain; the fine-tuned VBVR-Wan2.2 already far exceeds all proprietary models, and the ensemble pushes it further.
This is a training-free improvement — no new data, no fine-tuning, no architecture changes. It works purely by exploiting the CoS mechanism.
It validates the paper's theoretical contributions: understanding the mechanism enabled a practical intervention.
It suggests a broader principle: inference-time strategies that work with the model's natural reasoning dynamics (rather than against them) can yield gains.
The paper positions video generation as a "next-generation substrate for machine intelligence" — not just a content creation tool, but a medium in which models can reason about physical and logical problems.
Unlike text-based reasoning, video reasoning operates in a continuous, spatiotemporally grounded space. This may enable forms of spatial and physical reasoning that are difficult to express in language.
Chain-of-Steps vs. Chain-of-Thought: CoS in video diffusion is a visual analogue to CoT in LLMs. Both involve iterative refinement toward a solution. But CoS is implicit (emerges from denoising dynamics) while CoT is explicit (produced as text tokens). This connects back to our chain-of-thought reasoning module and the idea of "latent reasoning" (Coconut, reasoning in hidden states).
Emergent capabilities: Just as LLMs exhibit emergent abilities at scale (in-context learning, CoT, self-correction), video models exhibit emergent reasoning without being explicitly trained for it. The parallel suggests these may be general properties of large models trained on structured data.
From perception to reasoning: Our progression from CLIP (alignment) → LLaVA (visual QA) → video reasoning traces an arc from static recognition to dynamic understanding to active problem-solving.
If reasoning primarily happens along denoising steps, could we improve it by simply using more steps? Is there a "test-time compute" analogue for diffusion models?
The multi-path exploration in early denoising steps is reminiscent of beam search in text generation. Could we implement explicit search strategies over the latent space?
What are the limits of this kind of reasoning? The benchmark tasks (mazes, tic-tac-toe, pattern completion) are still relatively simple. Can video models tackle problems that require longer chains of logical inference?
How does this relate to world models in reinforcement learning? If a video model can "simulate" multiple futures and select the best one, is it functioning as a world model?
Title slide — Visual Reasoning over Video: Chain-of-Steps in Diffusion Models
Bridge — From image tokens (CLIP/LLaVA) to video: what changes
Diffusion review — Forward/reverse process, the denoising trajectory
DiT architecture — Spatiotemporal patches, bidirectional attention, 14B parameters
The CoF hypothesis — Prior assumption: reasoning unfolds frame-by-frame
CoS discovery — Reasoning unfolds along denoising steps; decoded intermediate latents
Multi-path exploration — Maze solving, tic-tac-toe, robot navigation examples
Superposition exploration — Pattern completion, rotation examples
Perturbation evidence — Noise-at-step vs. noise-at-frame; CKA dissimilarity
Emergent behavior 1: Working memory — Object permanence across denoising steps
Emergent behavior 2: Self-correction — "Aha moments," ball bounce, 3D rotation
Emergent behavior 3: Perception before action — Two-phase protocol, neuroscience parallel
Layer specialization — Perceive → Reason → Consolidate pipeline within each step
Latent swapping experiment — Causal evidence for middle-layer reasoning
Training-free ensemble — Multi-seed latent averaging in reasoning layers
Results — VBVR-Bench scores, comparison with proprietary models
The bigger picture — Video as reasoning substrate, connections to CoT and world models
Discussion questions — Open problems and future directions