Skip to content

B-enguin/authorship-representations

Repository files navigation

Embedding Representations for Efficient Authorship Attribution

Benjamin Koh . Chew Jun Heng . Cheng Lin

example

This code aims to introduce a new training paradigm that first learns effective representations for downstream tasks, such as classification and information retrieval.

Installation

  1. Install the required dependencies, this could be done in a venv or conda enviroment
pip install -r requirements.txt
  1. Install CuPy
# CUDA v11.2 ~ 11.8
pip install cupy-cuda11x
# CUDA v12.x
pip install cupy-cuda12x

More information can be found at the installation intructions for CuPy

  1. Download the IMDB1M Dataset
  2. Run the preprocessing script, this will download the remaining datasets and preprocess them for later tasks
python preprocess_data.py

Running

We first need to train model embeddings before we are able to test on further downstream tasks

Training Embeddings

To train a embedding model, run the following command,

python train_embeddings.py -c path/to/config.yaml

Pre-defined configs can be found in the embeddings folder, where bert and gte refers to the respective encoder backbone.

Information Retrival

In this task, we attempt to retrive documents that are from the same author, to test a model on this task, run the following command,

python ir.py -c path/to/config.yaml -m path/to/model.pt

Closed Set Classification

In this task, we perform classification on a closed set of authors, to test a model on this task, run the following command,

python closed_classification.py -c path/to/config.yaml -m path/to/model.pt

Open Set / Zero-shot Classification

In this task, we performin classification on a open set of authors, or in a zero-shot classifcation problem. To test a model on this task, run the following command,

python open_classification.py -c path/to/config.yaml -m path/to/model.pt

About

Embedding Representations for Efficient Authorship Attribution

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages