Repository for the image-to-recipe retrieval project in the course Advanced Deep Learning in Computer Vision (02501) at DTU. You can find our poster here!
The goal of image-to-recipe retrieval is to retrieve a recipe (from a list of known recipes) given an image query. In this case, we learn multi-modal representations of food through images and recipes. We achieve this by projecting the image and text recipes to a high-dimensional latent (vector) space using the encoder part of Transformer models. For any given image, the "correct" recipe can be retrieved by choosing the recipe vector with the lowest distance to the image vector in this latent space.
The image and text Transformers are trained simultaniously using the Triplet loss to embed images and texts into the same latent space, as visualised below. The model is self-supervised, as no explicit labels are given to the model. Instead, the model learns implicit labels via the Triplet loss function.
Learning a common embedding space means we can also perform the opposite task: recipe-to-image retrieval.
Create an environment with Python 3.10
and install the dependencies.
pip install -r requirements.txt
The Food Ingredients and Recipes data can be downloaded here. To train your model run the following:
python src/models/train_model.py --image_size 224 --model_name "models/best_model_ever_3.pt" --mode 3 --pretrained
The --mode
flag controls the text modality. 1 is just the title, 2 is title+ingredients, and 3 is title+ingredients+instructions.
Get results by running
python src/models/predict_model.py
Followed by
python src/visualization/visualize.py