Overview
(a) Existing visual navigation foundation models (NFMs) process monocular inputs and predict actions without any intermediate vision modules. (b) StereoWalker improves this paradigm by incorporating stereo and mid-level vision modules such as depth estimation and dense point tracking. (c) Relative to CityWalker, the current state of the art, StereoWalker reaches high training efficiency and improved navigation accuracy while using only 1.5% of the training data.
Abstract
The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
Method
While prior visual navigation models compress each frame to a single DINOv2 [CLS] token, StereoWalker retains all patch tokens to preserve fine-grained spatial structure critical for control. Our intuition is straightforward: accurate navigation demands richer visual perception than a global summary can provide.
As shown in the figure above, given a short temporal window of rectified stereo (or monocular) frames 𝒱t-N+1:t and their corresponding positions 𝒫t-N+1:t, the model forms dense mid-level tokens that jointly encode appearance, geometry, and short-term motion cues. Tokens from all frames are then processed by three stages:
(i) tracking-guided attention to maintain temporal correspondence and reduce drift,
(ii) global attention to integrate scene context across views, and
(iii) target-token attention to focus prediction on goal-relevant regions.
StereoWalker supports both stereo and monocular inputs with the same architecture, differing only in tokenization.
Results
Analysis of Mid-level Vision
Our ablation analysis evaluates different architectural configurations on the CityWalker teleoperation benchmark. All variants are trained on the same monocular dataset, with specific components selectively enabled or disabled for a fair comparison. Earlier baselines such as ViNT, GNM, NoMaD, and CityWalker represent each image using only a single [CLS] token. In contrast, we observe that using all patch tokens to capture finer spatial information leads to an immediate and significant improvement in the Mean Angular Orientation Error (MAOE).
Building upon this representation, we observe that incorporating depth and dense pixel tracking further enhances navigation accuracy, as these two mid-level cues provide complementary inductive signals. Depth captures the three-dimensional structure, yielding a substantial reduction in MAOE relative to the patch token model. Tracking encodes scene motion and temporal consistency, and incorporating tracking on top of patch tokens and depth provides additional performance gains.
Prior studies demonstrated similar advantages of mid-level vision in controlled or static environments. Our experiments in large-scale dynamic urban navigation show that explicitly modeling depth and motion significantly improves robustness and effectiveness in real-world conditions. Fine-tuned StereoWalker shows substantial improvements across Forward, Left turn, and Right turn scenarios.
Analysis of Training Efficiency
Beyond performance gains, we observe that incorporating mid-level vision capabilities significantly accelerates training. We carefully train our model on monocular data using only a fraction of the original dataset, achieving comparable performance with merely 1.5% of CityWalker's training data. Under different amounts of training data, we use the same training settings for both CityWalker and our model.
Enabling patch tokens introduces richer visual representations but also necessitates architectural modifications to the decoder, as our model no longer relies on a single [CLS] token representation. Consequently, our model with patch tokens alone requires additional training time to match CityWalker's performance. However, once depth cues are injected, the model quickly outperforms CityWalker.
Further incorporating both depth and tracking information leads to faster convergence and superior performance, surpassing CityWalker trained with over 2,000 hours of monocular videos. This demonstrates that mid-level vision not only enhances representation quality but also provides strong inductive biases that make training more data- and time-efficient.
Visualization
BibTeX
@article{zhou2025empoweringdynamicurbannavigation,
title = {Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision},
author = {Zhou, Wentao and Chen, Xuweiyi and Rajagopal, Vignesh and Chen, Jeffrey and Chandra, Rohan and Cheng, Zezhou},
journal = {arXiv preprint arXiv:2512.10956},
year = {2025}
}