Skip to content

NLP project on Argument Retrieval for comparative questions, trained on ClueWeb12 corpus and stance detection dataset.

License

Notifications You must be signed in to change notification settings

Valendrew/argument-retrieval-comparative-questions

 
 

Repository files navigation

Argument-retrieval-for-comparative-questions

This repository contains the work done for the Natural Language Processing course A.Y. 2022-2023 at the University of Bologna, Master's Degree in Artificial Intelligence.

We chose the project from the "Touché shared tasks at CLEF 2022" website. The synopsis for the challenge was:

Given a comparative topic and a collection of documents, the task is to retrieve relevant argumentative passages for either compared object or for both and to detect their respective stances with respect to the object they talk about.

We had to solve 2 different tasks during the project, the first one was more related to Information Retrieval while the second is about text classification.

1. Document retrieval for comparative questions

As explained above, we need to retrieve the most relevant text passages, with respect to a query, from a corpus of ~850k elements taken from the ClueWeb12 dataset. You can find all the available datasets for the task at this link.

Description of a basic pipeline

Implementation

In order to perform document retrieval efficiently, we had to create an index. We built our indexes with 2 different libraries, Pyserini for creating the sparse indexes and autofaiss for the dense indexes.

We built several indexes, created on some variants of the corpus (e.g. expanded, pre-processed) to get the best results for our pipelines.

The main goal was to find models that perform well on both quality and relevance (the Touché team gave some files to evaluate this metrics considering the nDCG). Indeed, we didn't create different models with different goals, but compact models with the objective of optimizing both scores.

You can look at the details of the different implemented pipelines in the src/ directory, where the classes are located. You can find all the information that you need, in order to reproduce the results, in the document_retrieval.ipynb notebook.

We suggest you to run the notebook in Colab to perform the heaviest operations and to import in an easier way the files from Google Drive. In the notebook and in the report we did some references to a Drive shared folder but it was only available for the professors to make them test the project. However, in the notebook you will find all the instructions to reproduce our experiments.

2. Stance detection

In this task we had to classify between 4 different classes:

  • NO, if the text doesn't express a stance;
  • NEUTRAL, if the text doesn't favour any of the 2 objects of the query;
  • FIRST, when the text favour the first object;
  • SECOND, when the text favour the second object.

Description of a basic pipeline

Implementation

Our first idea was to create a unique model to classify the four classes. We imported a pre-trained version of DistilBERT from Huggingface, we add a classifier layer on top of it and we fine-tuned on our data.

Unique model for stance detection

Unfortunately we discovered that this approach didn't work very well, probably due to the few data available for the fine-tuning and the unbalanced classes.

In order to improve the performance, we read a paper that inspired us to split the entire pipeline in two different models.

  • The first model had to detect if the text was favouring an object or not.
  • The second model determined whether the text was favouring the first or the second object.

System with two models

This system worked better and we also deal with class imbalance setting different weights to the four classes.

You can find the whole implementation and a detailed explanation in the stance_detection.ipynb notebook.

Project structure

.
├── document_retrieval.ipynb -> Notebook to run the document retrieval task on our models.
|
├── stance_detection.ipynb   -> Notebook to run the stance detection models.
│
├── images/	             -> Directory that contains images for the README
│
├── src/     	             -> Directory that contains the classes for our pipelines and evaluation scripts
│
├── utils/                   -> Directory that contains some files to manage the download of the files and other useful functions.
|
├── README.md
├── LICENSE
└── requirements.txt

Authors

The project has been implemented by:

About

NLP project on Argument Retrieval for comparative questions, trained on ClueWeb12 corpus and stance detection dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 94.7%
  • Python 5.3%