This project implements a multilingual Natural Language Inference (NLI) pipeline using the XLM-Roberta model. It predicts the relationship between a premise and a hypothesis (contradiction, neutral, entailment) across multiple languages. The workflow leverages PyTorch and HuggingFace Transformers for model training, evaluation, and prediction.
- Achieved >91% accuracy on the test submission data.
- Kaggle Data: Contradictory, My Dear Watson
- My Code on Kaggle: Hypothesis Prediction Using XLM-Roberta
- Multilingual NLI using XLM-Roberta (XNLI)
- Stratified train/eval split by language and label
- Weighted loss for handling class imbalance
- Custom PyTorch Dataset and Sampler for balanced multilingual batches
- Training loop with accuracy tracking and model checkpointing
- Evaluation with classification report and confusion matrix
- Test set prediction and CSV submission generation
- Data visualization for language distribution
hypothesis-prediction-using-xlm-roberta.ipynb
: Main Jupyter notebook containing all code and workflowLICENSE
: Project licenseREADME.md
: Project documentation
- Python 3.7+
- PyTorch
- Transformers (HuggingFace)
- pandas, numpy, seaborn, matplotlib, scikit-learn
Install dependencies (in notebook or terminal):
!pip install transformers torch pandas numpy seaborn matplotlib scikit-learn
- Train/Eval Data:
train.csv
from Kaggle's "Contradictory, My Dear Watson" competition - Test Data:
test.csv
from the same source
- Open the notebook:
hypothesis-prediction-using-xlm-roberta.ipynb
- Run all cells: The notebook will install dependencies, load data, preprocess, train, evaluate, and generate predictions.
- Model Training: The notebook trains XLM-Roberta on the NLI task, saving the best model based on evaluation accuracy.
- Evaluation: Prints classification report and confusion matrix for validation set.
- Test Prediction: Loads test data, makes predictions, and saves results to
submission.csv
.
- Imports & Setup: Installs and imports required libraries
- Data Loading & Visualization: Loads CSVs, analyzes and visualizes language distribution
- Model & Tokenizer: Loads XLM-Roberta model and tokenizer
- Custom Dataset & Sampler: Defines classes for balanced multilingual batching
- Training Loop: Trains model, tracks metrics, saves best checkpoint
- Evaluation & Reporting: Evaluates model, prints metrics
- Test Prediction: Generates predictions for test set
- Adjust
MODEL_NAME
to use different transformer models - Change
max_length
,batch_size
, ornum_epochs
for experimentation - Modify data paths for local or cloud environments
- The notebook provides detailed metrics and visualizations for model performance across languages.
- Final predictions are saved in
submission.csv
for Kaggle submission.
This project is licensed under the terms of the LICENSE file.