Research Abstract:
This project explores a potential approach to a challenge in autonomous driving: integrating predictive capabilities into perception-based driving systems. While existing end-to-end architectures appear effective for reactive decision-making, they may lack capabilities to anticipate future scene evolution. We propose a framework that aims to augment the SparseDrive architecture with a self-supervised Latent World Model (LWM) trained on DINOv2 embeddings, potentially enabling more temporally coherent scene understanding.
Our approach investigates learning action-conditioned dynamics in a latent space without requiring additional annotations. Preliminary evaluation on the NuScenes-Mini dataset suggests potential improvements: initial results indicate approximately 62% reduction in collision rate, 40% fewer tracking ID switches, and 14% lower trajectory error compared to our baseline implementation. These early findings may indicate that learned latent dynamics models could help enhance safety-critical metrics in autonomous driving systems, representing a potentially promising direction for bridging perception and control in embodied intelligence that warrants further investigation.
This repository aims to extend SparseDrive -- a sparse, end-to-end autonomous driving framework -- by exploring the integration of a Latent World Model (LWM) branch inspired by LAW (Enhancing End-to-End Autonomous Driving with Latent World Model, ICLR 2025).
Rather than training a visual encoder from scratch, we experiment with a frozen DINOv2 backbone with the goal of leveraging its potentially rich, geometry-aware features.
The model attempts to predict the evolution of scene representations in latent space, conditioned on ego actions, with the aim of potentially improving temporal consistency and planning robustness without requiring extra supervision.
Many traditional end-to-end driving pipelines appear to operate primarily in a reactive manner: processing each frame independently, with the planner responding primarily to the current state.
Our proposed Latent World Model approach aims to introduce more predictive reasoning -- attempting to learn how the latent scene state might change given the vehicle's motion.
Our proposed formulation:
Here
-
$z_t$ : latent embedding extracted from DINOv2 features, -
$u_t$ : ego action or egomotion (steering, throttle, brake, Deltapose), -
$f_{\theta}$ : latent dynamics network that attempts to predict the next latent state.
During training, we encourage the model to align its prediction $\hat{z}{t+1}$ with the frozen target latent $z{t+1}^{\text{tgt}}$ from the next frame.
We hypothesize that this predictive supervision approach might help the planner develop more temporally consistent representations. By attempting to learn the dynamics of scene evolution in latent space, our goal is to explore whether the model could potentially anticipate future states and make more informed planning decisions under uncertainty, which might help address what we believe is a limitation in some current autonomous driving systems.
- Integration of a LAW-style latent world model into SparseDrive's Stage-2 planning pipeline.
- DINOv2-based latent encoder providing strong pretrained semantics.
- Action-conditioned latent dynamics that model how visual states evolve.
- Auxiliary world-model loss improving temporal consistency and robustness.
- Empirical gains in tracking stability and planning safety on mini-NuScenes.
- Stage-1: Standard SparseDrive perception pretraining with detection and segmentation objectives
- Stage-2: Joint training of the planner and latent world model with temporal consistency constraints
- Encoder:
facebook/dinov2-large(frozen) providing 1024-dimensional features - Dynamics Network: MLP layers with residual connections and layer normalization (2.4M parameters)
- Optimization: AdamW (ฮฒโ=0.9, ฮฒโ=0.999), cosine LR schedule (initial lr=1e-4), ฮปwm=0.7
- Batch configuration: 4โ8 sequential frames for temporal pairing, 16 scenes per batch
- Regularization: Weight decay=0.01, gradient clipping at norm=1.0, dropout=0.1
The auxiliary self-supervised objective combines cosine similarity and L2 distance between predicted and target latents:
The cosine term enforces structural similarity in the latent space, while the L2 term ensures metric accuracy. This dual objective helps balance semantic coherence with geometric precision in the predicted representations.
Typical weights: ฮฑ = 1.0, ฮฒ = 0.5, ฮปwm โ 0.7.
The total training loss:
This encourages the planner's internal features to evolve smoothly in time, guided by realistic latent transitions. The weighting factor ฮปwm was determined through ablation studies to provide optimal balance between planning performance and world model accuracy.
NuScenes-Mini: A compact version of the full dataset with 10 scenes (~3.5 GB)
| Category | Metric | Enhanced (WM) | Vanilla SparseDrive | Delta |
|---|---|---|---|---|
| Detection | mAP | 0.4021 | 0.4138 | โ2.8 % |
| NDS | 0.4478 | 0.4512 | โ0.8 % | |
| Tracking | AMOTA | 0.5501 | 0.5253 | +4.7 % |
| AMOTP โ | 1.0319 | 1.0510 | +Quality โ | |
| MOTA | 0.5495 | 0.5361 | +2.5 % | |
| ID Switches โ | 64 | 108 | โ40 % | |
| Mapping | Boundary | 0.3940 | 0.3787 | +4.0 % |
| Forecasting | Ped EPA | 0.4447 | 0.3918 | +13.5 % |
| Car minADE โ | 0.4591 | 0.4682 | +1.9 % | |
| Planning | Collision Rate โ | 1.127 % | 2.979 % | โ62 % |
| L2 Error โ | 3.20 m | 3.74 m | โ14 % |
NuScenes trainval split: 139 training scenes and 31 validation scenes (~60 GB total)
| Category | Metric | Enhanced (WM) | Vanilla SparseDrive | Delta |
|---|---|---|---|---|
| Detection | mAP | 0.3650 | 0.3860 | โ5.4 % |
| NDS | 0.4444 | 0.4441 | +0.1 % | |
| Tracking | AMOTA | 0.2404 | 0.2760 | โ12.9 % |
| AMOTP โ | 1.4029 | 1.4086 | +Quality โ | |
| MOTA | 0.2537 | 0.2787 | โ9.0 % | |
| ID Switches โ | 573 | 1466 | โ60.9 % | |
| Mapping | mAP_normal | 0.5543 | 0.5221 | +6.2 % |
| Forecasting | Ped EPA | 0.3381 | 0.3949 | โ14.4 % |
| Car EPA | 0.3948 | 0.4303 | โ8.2 % | |
| Planning | Collision Rate โ | 0.556 % | 1.042 % | โ46.6 % |
| L2 Error โ | 4.0295 m | 2.8978 m | +39.1 % |
Key Observations Across Both Datasets
- โ๏ธ Temporal consistency: Results across both datasets demonstrate significant improvements in tracking stability, with 40-60% fewer ID switches, showing the world model helps maintain consistent object identities over time.
- ๐งญ Safety-critical metrics: The enhanced model shows a consistent 47-62% reduction in collision rates across both test sets, suggesting substantial improvements in planning safety.
- ๐ฏ Trade-offs: While maintaining competitive detection performance, the world model approach appears to prioritize collision avoidance over absolute trajectory precision in some scenarios.
- ๐บ๏ธ Spatial understanding: The enhanced model shows consistent improvements in mapping performance, indicating better scene structure comprehension.
- ๐ Scaling behavior: The consistency of findings between the mini dataset and the larger validation set (which is ~17x larger) suggests that our approach scales effectively to more diverse and complex scenes.
These results suggest meaningful improvements across several metrics, particularly in safety-critical areas. The consistent reduction in collision rates across different dataset scales is particularly encouraging. Our experiments continue as we refine both the baseline and enhanced models across even more diverse driving scenarios.
- Python 3.8+
- CUDA 11.6+ and cuDNN
- PyTorch 1.13.0
- mmcv_full 1.7.1
- mmdet 2.28.2
- numpy 1.23.5
- transformers 4.46.3 (for DINOv2 integration)
- flash-attn 2.3.2
- nuscenes-devkit 1.1.10
Follow these steps to set up the development environment:
# Clone this repository
git clone https://github.com/ahmeddawy/SparseDrive_WorldModel.git
cd SparseDrive_WorldModel
# Create a conda environment (recommended for isolation)
conda create -n sparsedrive_lwm python=3.8 -y
conda activate sparsedrive_lwm
# Set path and install PyTorch with CUDA 11.6 support
sparsedrive_path="$(pwd)" # Get absolute path to repository
cd ${sparsedrive_path}
pip3 install --upgrade pip
pip3 install torch==1.13.0+cu116 torchvision==0.14.0+cu116 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu116
pip3 install -r requirement.txt
# Compile the custom CUDA operators
# This step is essential for the deformable attention mechanism
cd projects/mmdet3d_plugin/ops
python3 setup.py develop
cd ../../../
# Verify installation by running a sanity check
python -c "import torch; import mmdet; import mmdet3d; print('Installation successful!')"Note: CUDA version 11.6 is required. If you have a different CUDA version, you'll need to adjust the PyTorch installation URLs accordingly.
- Download the NuScenes dataset from the official website
- Link or place the dataset in
data/nuscenes/ - Prepare the info files:
bash scripts/create_data.sh- Generate K-means clusters for anchors:
bash scripts/kmeans.shThe training process is divided into two stages:
Stage 1: Perception pre-training (can be skipped if using provided checkpoint)
bash ./tools/dist_train.sh \
projects/configs/sparsedrive_small_stage1.py \
1 \
--deterministicStage 2: Joint training with Latent World Model
bash ./tools/dist_train.sh \
projects/configs/sparsedrive_small_stage2.py \
1 \
--deterministicEvaluate the model on the NuScenes mini dataset:
# Set the environment variable to use mini dataset
export NUSCENES_VERSION=mini
bash ./tools/dist_test.sh \
projects/configs/sparsedrive_small_stage2.py \
work_dirs/sparsedrive_small_stage2/latest.pth \
1 \
--deterministic \
--eval bboxVisualize detection, tracking, mapping, and planning results:
export PYTHONPATH="$(dirname $0)/..":$PYTHONPATH
python tools/visualization/visualize.py \
projects/configs/sparsedrive_small_stage2.py \
--result-path work_dirs/sparsedrive_small_stage2/results.pklThe visualization outputs will be saved in the vis/ directory.