CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information

Introduction

The official repository for CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information.

Vision-and-language navigation (VLN) aims to guide autonomous agents through complex environments by integrating visual and linguistic inputs. While various methods have been proposed to combine these modalities, most existing VLN datasets focus on ground-level navigation. This leaves aerial navigation largely unexplored due to the lack of suitable training and evaluation resources. To address this gap, we introduce CityNav, a new dataset for language-goal aerial navigation using a 3D point cloud representation of real cities. It includes 32,637 natural language descriptions matched with demonstration trajectories, collected from human participants through our newly developed web-based 3D simulator. Each description specifies a navigation goal, leveraging the names and locations of landmarks within the real-world city. We also provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the instructions. We benchmark the latest aerial navigation baselines and our proposed model on the CityNav dataset. The results reveal the following key findings: (i) The proposed models trained on human demonstration trajectories outperform those trained on shortest path trajectories, highlighting the importance of human-driven navigation strategies. (ii) Integrating a 2D spatial map significantly enhances navigation efficiency at city scale.

Please check out the project website at https://water-cookie.github.io/city-nav-proj/ .

Setup

This code was developed with Python 3.10, PyTorch 2.2.2, and CUDA 11.8 on Ubuntu 22.04.

To set up the environment, create the conda environment and install PyTorch.

conda create -n mgp python=3.10 &&
conda activate mgp &&
conda install pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia

Then install Set-of-Marks and its dependencies.

conda install mpi4py

pip install git+https://github.com/water-cookie/Segment-Everything-Everywhere-All-At-Once.git@package
pip install git+https://github.com/water-cookie/Semantic-SAM.git@package
pip install git+https://github.com/facebookresearch/segment-anything.git 

git clone https://github.com/water-cookie/SoM.git  &&
cd SoM/ops && ./make.sh && cd ..  &&
pip install --editable . && cd ..

Next, install LLaVA and Grounding DINO.

pip install git+https://github.com/water-cookie/LLaVA.git
pip install git+https://github.com/IDEA-Research/GroundingDINO.git
pip install git+https://github.com/ChaoningZhang/MobileSAM.git

Once LLaVA and Grounding DINO are installed, install the dependencies for CityNav.

pip install -r requirements.txt

Finally, the weights can be downloaded by running the following script.

sh scripts/download_weights.sh

The downloaded weight files shoud be organized in the following heirarchy.

CityNav/
├─ weights/
│  ├─ groundingdino/
│  │  ├─ groundingdino_swinb_cogcoor.pth
│  │  ├─ groundingdino_swint_ogc.pth
│  ├─ mobile_sam/
│  │  ├─ mobile_sam.pt
│  ├─ som/
│  │  ├─ sam_vit_h_4b8939.pth
│  │  ├─ seem_focall_v1.pt
│  │  ├─ swinl_only_sam_many2many.pth
│  ├─ vlnce/
│  │  ├─ ddppo-models/
│  │  │  ├─ gibson-2plus-resnet50.pth
│  │  │  ├─ ...
│  │  ├─ R2R_VLNCE_v1-3_preprocessed/
│  │  │  ├─ embeddings.json.gz
│  │  │   ...

To run CMA/Seq2Seq models, run the following instead.

conda create -n vlnce python=3.10 &&
conda activate vlnce &&
conda install pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia &&

pip install gymnasium opencv-python pillow rasterio shapely tqdm transformers wandb &&
pip install msgpack-rpc-python &&
pip install airsim

Data Preparation

The dataset can be downloaded with the following script.

sh scripts/download_data.sh

Download SensatUrban dataset and run the following script to rasterize the pointclouds.

sh scripts/rasterize.sh path_to_ply_dir/train
sh scripts/rasterize.sh path_to_ply_dir/test

The dataset and images should be placed in the directories presented below.

citynav/
├─ data/
│  ├─ cityrefer/
│  │  ├─ objects.json
│  │  ├─ processed_descriptions.json
│  ├─ citynav/
│  │  ├─ citynav_train_seen.json
│  │  ├─ ...
│  ├─ rgbd/
│  │  ├─ birmingham_block_0.png
│  │  ├─ birmingham_block_0.tiff
│  │  ├─ ...
│  ├─ gsam/
│  │  ├─ full_scan_(100, 240, 410).npz

Usage

Run the following script to train MGP model with the human-collected trajectories.

python main_goal_predictor.py \
    --mode train \
    --altitude 50 \
    --gsam_use_segmentation_mask \
    --gsam_box_threshold 0.20 \
    --learning_rate 0.0015 \
    --train_batch_size 12 \
    --train_trajectory_type mturk

Once the checkpoint has been saved, use the following script to evaluate the checkpoint.

python main_goal_predictor.py \
    --mode eval \
    --altitude 50 \
    --gsam_use_segmentation_mask \
    --gsam_box_threshold 0.20 \
    --eval_batch_size 200 \
    --eval_max_timestep 20 \
    --checkpoint path/to/checkpoint

The target and surrounding maps for training is cached in the file data/gsam/full_scan_(100, 240, 410).npz. To use the cached maps, add the argument --gsam_use_map_cache to the training script.

Pretrained Models

Baselines	NE(m)	SR(%)	OSR(%)	SPL(%)	Checkpoints
Seq2Seq w/ SP	174.5	1.73	8.57	1.69	💾
Seq2Seq w/ HD	245.3	1.50	8.34	1.30	💾
CMA w/ SP	179.1	1.61	10.07	1.57	💾
CMA w/ HD	252.6	0.82	9.70	0.79	💾
MGP w/ SP	109.0	4.73	17.47	4.62	💾
MGP w/ HD	93.8	6.38	26.04	6.08	💾

Qualitative Results

Citation

@misc{lee2024citynavlanguagegoalaerialnavigation,
      title={CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information}, 
      author={Jungdae Lee and Taiki Miyanishi and Shuhei Kurita and Koya Sakamoto and Daichi Azuma and Yutaka Matsuo and Nakamasa Inoue},
      year={2024},
      eprint={2406.14240},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.14240}, 
}

License

CityNav Dataset : CC BY 4.0
Codebase : MIT License

Acknowledgements

We would like to express our gratitude to the authors of the following codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
gsamllavanav		gsamllavanav
images		images
notebooks		notebooks
scripts		scripts
vlnce		vlnce
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_baseline_with_map.py		main_baseline_with_map.py
main_goal_predictor.py		main_goal_predictor.py
main_vlnce.py		main_vlnce.py
rasterize.py		rasterize.py
requirements.txt		requirements.txt
save_gsam_maps_cache_by_caption.py		save_gsam_maps_cache_by_caption.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information

Introduction

Setup

Data Preparation

Usage

Pretrained Models

Qualitative Results

Citation

License

Acknowledgements

About

Releases

Packages

Languages

License

water-cookie/citynav

Folders and files

Latest commit

History

Repository files navigation

CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information

Introduction

Setup

Data Preparation

Usage

Pretrained Models

Qualitative Results

Citation

License

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages