Princeton University
Video diffusion models commit to a high-level motion plan within the first few denoising steps when solving mazes. We exploit this early plan commitment to build ChEaP, which screens diverse candidate trajectories early and chains them together to solve mazes far beyond the single-generation horizon—improving accuracy from 7% to 67% on long mazes and by 2.5× overall on hard tasks.
By decoding intermediate predictions during denoising, we find that the model's motion trajectory is already committed within the first few steps. The remaining steps refine visual details but almost never change the underlying route.
Since early plans predict final outcomes, we screen many candidates cheaply and only fully render the most promising ones. Different noise seeds produce strikingly different trajectories on the same maze—failed paths appear as gray silhouettes, the successful trajectory in vivid color.
@article{newman2025videomodelsreason,
title = {Video Models Reason Early: Exploiting Plan
Commitment for Maze Solving},
author = {Newman, Kaleb and Zhu, Tyler and
Russakovsky, Olga},
journal = {arXiv preprint},
year = {2026}
}