Benjamin Koh . Chew Jun Heng . Cheng Lin
This code aims to introduce a new training paradigm that first learns effective representations for downstream tasks, such as classification and information retrieval.
- Install the required dependencies, this could be done in a
venvor conda enviroment
pip install -r requirements.txt- Install CuPy
# CUDA v11.2 ~ 11.8
pip install cupy-cuda11x
# CUDA v12.x
pip install cupy-cuda12xMore information can be found at the installation intructions for CuPy
- Download the IMDB1M Dataset
- Run the preprocessing script, this will download the remaining datasets and preprocess them for later tasks
python preprocess_data.pyWe first need to train model embeddings before we are able to test on further downstream tasks
To train a embedding model, run the following command,
python train_embeddings.py -c path/to/config.yamlPre-defined configs can be found in the embeddings folder, where bert and gte refers to the respective encoder backbone.
In this task, we attempt to retrive documents that are from the same author, to test a model on this task, run the following command,
python ir.py -c path/to/config.yaml -m path/to/model.ptIn this task, we perform classification on a closed set of authors, to test a model on this task, run the following command,
python closed_classification.py -c path/to/config.yaml -m path/to/model.ptIn this task, we performin classification on a open set of authors, or in a zero-shot classifcation problem. To test a model on this task, run the following command,
python open_classification.py -c path/to/config.yaml -m path/to/model.pt