Multiple Languages and Modalities (MLM) is a dataset consisting of text in three languages (EN, FR, DE), images, location data, and triple classes. The resource is designed to evaluate the strengths of multitask learning systems in generalising on diverse data. The paper defines a benchmark evaluation consisting of the following tasks:
- Cross-modal retrieval
- Location estimation
IR+LE is an architecture for a multitask learning system designed as a baseline for the above benchmark. The pipeline for cross-modal retrieval extends an approach proposed by Marin et al: http://im2recipe.csail.mit.edu/im2recipe-journal.pdf.
Multitask IR+LE Framework
Python version >= 3.7
PyTorch version >= 1.4.0
# clone the repository
git clone https://github.com/GOALCLEOPATRA/MLM.git
cd MLM
pip install -r requirements.txt
Download the dataset hdf5 files from here and place them under the data folder.
Multitask Learning (IR + LE)
python train.py --task mtl
Cross-modal retrieval task
python train.py --task ir
Location estimation task
python train.py --task le
For setting other arguments (e.g. epochs, batch size, dropout), please check args.py.
Multi-task Learning (IR + LE)
python test.py --task mtl
Cross-modal retrieval task
python test.py --task ir
Location estimation task
python test.py --task le
All logs and checkpoints will be saved under the experiments folder.
The repository is under MIT License.
@INPROCEEDINGS{10030783,
author={Armitage, Jason and Impett, Leonardo and Sennrich, Rico},
booktitle={2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
title={A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues},
year={2023},
volume={},
number={},
pages={1094-1103},
doi={10.1109/WACV56688.2023.00115}}