This is the repository associated with the NLP project for the MVA course Algorithms for Speech and NLP.
This is a group project realised by :
- Gabriel Watkinson
- Josselin Dubois
- Marine Astruc
- Javier Ramos Gutiérrez
Each task was realized separately, and the lack of time led to the code not being well harmonized.
Don't hesitate to contact any of us if you have any questions.
- Clone the repository.
git clone https://github.com/gwatkinson/mva_snlp_canine
- Install the project and dependencies, creating a virtual environment with
poetry
(you need to install poetry it beforehand). If you prefer to use an existing environment, you just have to activate it and run the same command:
poetry install
- Activate the created environment if needed.
source $(poetry env info --path)/bin/activate # for linux
# & ((poetry env info --path) + "\Scripts\activate.ps1") # for windows powershell
# poetry shell # or this spawns a new shell
- Install pre-commit, if you are planning to add code.
pre-commit install
- Use Pytorch with GPU support (optional). Use this if Pytorch doesn't see your GPU. This reinstalls Pytorch in the virtual environment, and needs to be rerun after each modification of the environment.
poe torch_cuda
In this section, we will describe how to reproduce the experiments for the NLI task.
All the functions and configs used for those experiments are in the mva_snlp_canine/nli
folder.
To run all the experiments without changing anything (really long), run:
source scripts/nli_run_all.sh
Else, you can look at the following to create or run another experiment.
The experiments can be configured from config files.
To generate a generic one with some prompts, you can run:
nli_create_config [OPTIONS] EXPERIMENT_NAME
Usage :
nli_create_config [OPTIONS] EXPERIMENT_NAME
Command that creates a config file to run an experiment.
Options:
--train_languages_subset TEXT Languages to use for training [default: en]
--save_local BOOLEAN Save the processed dataset locally
[default: True]
--push_to_hub BOOLEAN Push the processed dataset to the
HuggingFace Hub [default: False]
--huggingface_username TEXT HuggingFace username [default: Gwatk]
--num_train_samples INTEGER Number of samples to use for training
[default: 300000]
--num_val_samples INTEGER Number of samples to use for validation
[default: 2490]
--num_test_samples INTEGER Number of samples to use for testing
[default: 5000]
--num_train_epochs INTEGER Number of training epochs [default: 5]
--learning_rate FLOAT Learning rate [default: 0.0001]
--batch_size INTEGER Batch size [default: 8]
--gradient_accumulation_steps INTEGER
Number of gradient accumulation steps
[default: 4]
--fp16 BOOLEAN Whether to use mixed-precision training
[default: True]
--help Show this message and exit.
Then, you should look into the newly created file in the mva_snlp_canine/nli/configs
folder and change additional options if needed (especially the training arguments).
From a config file, you just need to run:
nli_run_experiment EXPERIMENT_NAME
This will train the models, and can be quite long.
A bash file in the scripts
folder also exists, that reproduces all the experiments mentionned in our report:
source scripts/run_nli_exps.sh
Once the model is trained, to evaluate it on the test set and on all languages, run:
nli_evaluate_experiment EXPERIMENT_NAME
The associated script is:
source scripts/evaluate_nli_exps.sh
The evaluation step returns a dataframe, to visualize the results, run:
nli_visualise_results [OPTIONS] EXP_NAME
Options:
--num TEXT Number of samples used in the training set, optional.
--languages TEXT Languages used in the training set, optional.
--attacked Whether to visualise attacked metrics, default fault.
--help Show this message and exit.
The associated script is:
source scripts/visualise_nli_results.sh
Lastly, we aslo used nlpaug
to generate some perturbed inputs and then look at how the model reacts.
To generate these datasets and evaluate, run:
nli_augmented_dataset [OPTIONS] EXP_NAME
Evaluate the experiment in the given directory.
Options:
--language_subset TEXT The languages to evaluate the model on. Options are
["ar", "bg", "de", "el", "en", "es", "fr", "hi",
"ru", "sw", "th", "tr", "ur", "vi", "zh"] [default:
en,fr]
--help Show this message and exit.
Then you can use the previous command to generate plots, using the --attacked
flag.
The associated script is:
source scripts/evaluate_nli_attacks.sh