Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

Kaleb Newman · Tyler Zhu · Olga Russakovsky

Princeton University

Video generation models have demonstrated emergent reasoning capabilities like solving mazes, puzzles, and physical reasoning tasks without task-specific training. Yet despite growing interest, we lack a basic understanding of how such reasoning emerges during generation and how reliably we can elicit it. We study the internal planning dynamics of video models using maze-solving as a controlled testbed. We discover early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. We exploit this to build ChEaP (Chaining with Early Planning), improving accuracy from 7% to 67% on long mazes and by 2.5× overall on hard tasks.

Video Models Can Reason

Given just an image of a maze and a text prompt (hover to see!), off-the-shelf video diffusion models can successfully generate solutions to complex mazes across a range of diverse visual settings [1].

Frozen Lake Prompt

Animate the elf moving step by step toward the gift while carefully avoiding the icy frozen lake. Highlight the successful path and end with the elf touching the gift. There are no changes to the layout of the maze. No new lakes or characters appear. Static camera. No zoom. No pan. No glitches, noise, or artifacts.

Frozen Lake Prompt

VR-Bench Maze Prompt (Skin 1)

Create a 2D animation based on the provided image of a maze. The red circle slides smoothly along the white square path, stopping perfectly on the green square. The red circle never slides or crosses into the light blue square areas of the maze. The camera is a static, top-down view showing the entire maze. Maze: The maze paths are white square, the walls are light blue square. The red circle moves to the goal position, represented by green square. The red circle slides smoothly along the white square path. The red circle never slides or crosses into the light blue square areas of the maze. The red circle stops perfectly on the green square. Scene: No change in scene composition. No change in the layout of the maze. The red circle travels along the white square path without speeding up or slowing down. Camera: Static camera. No zoom. No pan. No glitches, noise, or artifacts.

VR-Bench Maze Prompt (Skin 2)

Create a 2D animation based on the provided image of a maze. The white rabbit slides smoothly along the green grass tiles path, stopping perfectly on the orange carrots. The white rabbit never slides or crosses into the gray rock areas of the maze. The camera is a static, top-down view showing the entire maze. Maze: The maze paths are green grass tiles, the walls are gray rock. The white rabbit moves to the goal position, represented by orange carrots. The white rabbit slides smoothly along the green grass tiles path. The white rabbit never slides or crosses into the gray rock areas of the maze. The white rabbit stops perfectly on the orange carrots. Scene: No change in scene composition. No change in the layout of the maze. The white rabbit travels along the green grass tiles path without speeding up or slowing down. Camera: Static camera. No zoom. No pan. No glitches, noise, or artifacts.

VR-Bench Maze Prompt (Skin 4)

Create a 2D animation based on the provided image of a maze. The anime schoolgirl slides smoothly along the wooden floor tiles path, stopping perfectly on the green square. The anime schoolgirl never slides or crosses into the gray stone wall areas of the maze. The camera is a static, top-down view showing the entire maze. Maze: The maze paths are wooden floor tiles, the walls are gray stone wall. The anime schoolgirl moves to the goal position, represented by green square. The anime schoolgirl slides smoothly along the wooden floor tiles path. The anime schoolgirl never slides or crosses into the gray stone wall areas of the maze. The anime schoolgirl stops perfectly on the green square. Scene: No change in scene composition. No change in the layout of the maze. The anime schoolgirl travels along the wooden floor tiles path without speeding up or slowing down. Camera: Static camera. No zoom. No pan. No glitches, noise, or artifacts.

But success is unreliable. Most random seeds produce failed trajectories, and performance degrades rapidly with maze complexity. Standard best-of-N sampling helps, but wastes massive compute fully denoising every candidate. Can we understand how these models reason internally and use that to elicit better performance?

The Plan Is Decided Early

By decoding intermediate predictions during denoising, we discover that the model's motion trajectory is already committed within the first few steps. Later steps refine visual fidelity but almost never change the underlying route.

Step 1 / 40

Step 2 / 40

Step 5 / 40

Step 10 / 40

Step 20 / 40

Final (40 / 40)

The plot on the left measures how similar the trajectory at each intermediate step is to the final output. On size 4 mazes, by step 5 of 40, trajectories are 93% converged—the route is essentially decided, and the remaining 35 steps just refine rendering quality. This holds across all maze sizes.

Trajectory similarity to the final output at each denoising step, across maze sizes.

Trajectory diversity from refinement vs. different seeds. The dashed line shows cross-seed diversity.

A natural follow-up: can we get diverse trajectories by refining the same seed? Refinement, i.e., re-noising an intermediate prediction and continuing denoising, is a common technique for sample diversity in flow matching. We find it produces at most 25% trajectory diversity, compared to 68% between entirely different seeds (dashed line). The trajectory is encoded in the initial noise, and refinement cannot change it.

If early trajectories predict final success, then inference-time compute should screen more candidate trajectories rather than fully denoising every seed.

This insight motivates Early Planning Beam Search (EPBS): instead of fully denoising every seed, we partially denoise a large pool of candidates for just a few steps, decode their early trajectory plans, and score them with a lightweight verifier. Only the top-K candidates are fully denoised—the rest are discarded, saving the vast majority of compute. Under a fixed budget of function evaluations, EPBS screens far more candidate trajectories than standard best-of-N: for example, 73 candidates vs. 10 at 400 NFEs.

Screening Diverse Candidate Plans

Different noise seeds produce strikingly different motion plans on the same maze. Failed paths appear as gray silhouettes; the successful trajectory in vivid color.

4×4 Maze

6×6 Maze

8×8 Maze

Each video above visualizes decoded early predictions from multiple noise seeds on the same maze. This is exactly what EPBS does at inference time: it reveals early trajectory plans and uses a lightweight verifier to promote the best one. Only the top-scoring candidates are fully denoised—the rest are discarded after just a few steps.

EPBS consistently outperforms best-of-N sampling, matching its accuracy at 3.3× fewer function evaluations.

What Makes a Maze Hard?

Path length—not obstacle density—is the dominant predictor of difficulty. Models reliably solve short paths but face a sharp cliff beyond 10–12 steps.

EPBS success rate drops sharply once the path exceeds the single-generation horizon.

8×8, Short Path (3 moves) Solved

6×6, Long Path (10 moves) Failed

The 8×8 maze on the left has a short 3-move solution and is solved cleanly. The 6×6 maze on the right requires 10 moves—exceeding what the model can fit in a single generation. Rather than producing a valid partial trajectory, it “cheats” by moving the gift closer. This motivates decomposing long-horizon tasks into shorter sub-problems.

Chaining for Long-Horizon Reasoning

When a maze exceeds the model's horizon, we chain sequential generations—each picking up where the last left off. Together with EPBS, this forms ChEaP (Chaining with Early Planning).

Chaining reconditions on the last valid frame to extend reasoning beyond the single-generation horizon.

After each generation, we extract the agent's furthest valid position and use that frame as the next starting condition. On mazes with path lengths of 10–13, chaining boosts accuracy from 7% to 67%, demonstrating that the model possesses strong local planning ability that was previously hidden by the generation length bottleneck.

ChEaP provides the largest gains where EPBS is bottlenecked by the generation horizon.

Chained 6×6 Solved

Chained 8×8 Solved

VR-Bench Solved

Gallery

Successes, diverse trajectories, and failure cases across Frozen Lake and VR-Bench.

Successes

4×4 Solved

6×6 Solved

8×8 Solved

10×10 Solved

Trajectory Exploration

4×4 · Lake 20%

6×6 · Lake 65%

8×8 · Lake 80%

10×10 · Lake 65%

VR-Bench

Maze Skin 1 Solved

Maze Skin 3 Solved

Trapfield 2 Solved

Trapfield 4 Solved

Failure Cases

Lake Entry Failed

Wrong Path Failed

Out of Time Failed

10×10 Lake Entry Failed

References

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. "Video Models are Zero-Shot Learners and Reasoners." arXiv preprint arXiv:2509.20328, 2025. arXiv

BibTeX

@misc{newman2026videomodelsreasonearly,
      title={Video Models Reason Early: Exploiting Plan Commitment for Maze Solving},
      author={Kaleb Newman and Tyler Zhu and Olga Russakovsky},
      year={2026},
      eprint={2603.30043},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.30043},
}