Skip to content

Official repository of the preprint "Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions"

License

Notifications You must be signed in to change notification settings

hsp-iit/embodied-captioning

Repository files navigation

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin.

[arXiv] [webpage]

Table of Contents

  1. Setup
  2. Data
  3. Demo
  4. Contributing
  5. Credits
  6. Enquiries, Question and Comments
  7. License

Setup

  • Clone repository: git clone https://github.com/hsp-iit/embodied-captioning.git
  • Clone submodules: git submodule update --init --recursive
  • Create and activate conda env:
conda env create -f env.yml
conda activate SImCa
  • Install habitat:
conda install habitat-sim withbullet -c conda-forge -c aihabitat
cd third_parties/habitat-lab
pip install -e habitat-lab
pip install -e habitat-baselines
cd -
  • Install detectron2:
cd third_parties
python -m pip install -e detectron2
# Rename folder datasets due to conflicts with transformers.datasets
mv detectron2/datasets detectron2/datasets_det 
cd -
  • Install pyntcloud:
cd third_parties/pyntcloud
python setup.py develop
cd -
  • Install other dependencies: pip install -r requirements.txt
  • Install module: python setup.py develop
  • Modify base_dir inside confs/train_policy.yaml as the absolute path of this project

Data

All the data are contained inside the data directory.

Example: habitat test scenes

  1. Download test-scenes http://dl.fbaipublicfiles.com/habitat/habitat-test-scenes.zip
  2. Unzip inside project directory
  • Suggestion: you can keep data separate and use soft links (ln -s /path/to/dataset /path/to/project/data)

Scenes datasets

Scenes models Extract path Archive size
Gibson data/scene_datasets/gibson/{scene}.glb 1.5 GB
HM3D data/scene_datasets/hm3d/{split}/00\d\d\d-{scene}/{scene}.basis.glb 130 GB

MatterPort

  • Follow instruction in the main Habitat-lab repository

Gibson

  • Fill the form to get Gibson download links at https://stanfordvl.github.io/iGibson/dataset.html
  • Download Gibson compatible with Habitat (train_val folder) containing the .glb files
  • Download gibson tiny with wget https://storage.googleapis.com/gibson_scenes/gibson_tiny.tar.gz
  • Follow instructions at Habitat-sim to generate gibson semantic
  • Place gibson semantic .ids and .scn files with the .glb and .navmesh files in the folder data/scene_datasets/gibson

Tasks

Gibson object navigation (used for the main results of the paper)

You can download the task at the following link {ADD LINK}, unzip and put it in data/datasets/objectnav/gibson/v1.1

Habitat's pointnav

Task Scenes Link Extract path Config to use Archive size
Point goal navigation Gibson pointnav_gibson_v1.zip data/datasets/pointnav/gibson/v1/ datasets/pointnav/gibson.yaml 385 MB
Point goal navigation corresponding to Sim2LoCoBot experiment configuration Gibson pointnav_gibson_v2.zip data/datasets/pointnav/gibson/v2/ datasets/pointnav/gibson_v2.yaml 274 MB
Point goal navigation MatterPort3D pointnav_mp3d_v1.zip data/datasets/pointnav/mp3d/v1/ datasets/pointnav/mp3d.yaml 400 MB

Demo

Launch an experiment

python scripts/run_exp.py run a policy in train or generate mode. More information below.

Replay experiment

To replay an experiment, use the following python scripts/visualize_exp.py replay.episode_id={ID episode} replay.exp_name={PATH TO EPISODE} replay.modalities="['rgb', 'depth','semantic']"

Baselines and environments

The following RL baselines are implemented:

  • goalexplorationbaseline-v0 State: disagreement_t, map_t, agent pose

The following classical baselines are implemented:

  • randomgoalsbaseline
  • frontierbaseline-v1 (frontierbaseline-v0 is deprecated)
  • bouncebaseline
  • rotatebaseline
  • randombaseline
  • observeobjectdiscreteactionsbaseline

The following environments are also implemented:

  • SemanticCuriosity-v0 reward: sum(disagreement_t)
  • SemanticDisagreement-kl reward: sum(disagreement_t)

Train goalexploration policy

Start from confs/habitat/gibson_goal_exploration.yaml

  • CHECKPOINT_FOLDER folder in which checkpoints are saved
  • TOTAL_NUM_STEPS max number of training steps
  • under ppo:
    • replanning_steps how often to run the policy
    • num_global_steps how often to train the policy
    • save_periodic how often to save a checkpoint
    • load_checkpoint_path full path to a checkpoint to load at start
    • load_checkpoint set True to load load_checkpoint_path
    • visualize if True, debug images are shown
    • mode set to "train"
  • captioner:
    • gpu_ids identifiers of the gpus to use
    • arch_name name of the architecture to load
    • model_namename of the model to load
    • checkpoint_name name of the checkpoint to load
    • debug print the predicted captions

Generate from goalexploration policy

Start from confs/habitat/gibson_goal_exploration.yaml

  • mode set to "generate"

Contributing

If you find an error, if you want to suggest a new feature or a change, you can use the issues tab to raise an issue with the appropriate label.

Credits

@inproceedings{galliena2025embodied,
  title={Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions},
  author={Galliena, Tommaso and Apicella, Tommaso and Rosa, Stefano and Morerio, Pietro and Del Bue, Alessio and Natale, Lorenzo},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}

Enquiries, Question and Comments

If you have any further enquiries, question, or comments, or you would like to file a bug report or a feature request, please use the Github issue tracker.

Licence

This work is licensed under the MIT License. To view a copy of this license, see LICENSE.

About

Official repository of the preprint "Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages