Embedding Representations for Efficient Authorship Attribution

Benjamin Koh . Chew Jun Heng . Cheng Lin

This code aims to introduce a new training paradigm that first learns effective representations for downstream tasks, such as classification and information retrieval.

Installation

Install the required dependencies, this could be done in a venv or conda enviroment

pip install -r requirements.txt

Install CuPy

# CUDA v11.2 ~ 11.8
pip install cupy-cuda11x
# CUDA v12.x
pip install cupy-cuda12x

More information can be found at the installation intructions for CuPy

Download the IMDB1M Dataset
Run the preprocessing script, this will download the remaining datasets and preprocess them for later tasks

python preprocess_data.py

Running

We first need to train model embeddings before we are able to test on further downstream tasks

Training Embeddings

To train a embedding model, run the following command,

python train_embeddings.py -c path/to/config.yaml

Pre-defined configs can be found in the embeddings folder, where bert and gte refers to the respective encoder backbone.

Information Retrival

In this task, we attempt to retrive documents that are from the same author, to test a model on this task, run the following command,

python ir.py -c path/to/config.yaml -m path/to/model.pt

Closed Set Classification

In this task, we perform classification on a closed set of authors, to test a model on this task, run the following command,

python closed_classification.py -c path/to/config.yaml -m path/to/model.pt

Open Set / Zero-shot Classification

In this task, we performin classification on a open set of authors, or in a zero-shot classifcation problem. To test a model on this task, run the following command,

python open_classification.py -c path/to/config.yaml -m path/to/model.pt

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
images		images
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
closed_classification.py		closed_classification.py
ir.py		ir.py
open_classification.py		open_classification.py
preprocess_data.py		preprocess_data.py
requirements.txt		requirements.txt
train_embeddings.py		train_embeddings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embedding Representations for Efficient Authorship Attribution

Installation

Running

Training Embeddings

Information Retrival

Closed Set Classification

Open Set / Zero-shot Classification

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

B-enguin/authorship-representations

Folders and files

Latest commit

History

Repository files navigation

Embedding Representations for Efficient Authorship Attribution

Installation

Running

Training Embeddings

Information Retrival

Closed Set Classification

Open Set / Zero-shot Classification

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages