SynthDetectives@ALTA2023 Stacking the Odds: Transformer-based Ensemble for AI-generated Text Detection
Repository for the paper "Stacking the Odds: Transformer-Based Ensemble for AI-Generated Text Detection"
Stacking ensemble of Transformers trained to detect AI-generated text for the ALTA Shared Task 2023.
Abstract: This paper reports our submission under the team name 'SynthDetectives' to the ALTA 2023 Shared Task. We use a stacking ensemble of Transformers for the task of AI-generated text detection. Our approach is novel in terms of its choice of models in that we use accessible and lightweight models in the ensemble. We show that ensembling the models results in an improved accuracy in comparison with using them individually. Our approach achieves an accuracy score of 0.9555 on the official test data provided by the shared task organisers.
dataset
: datasetsrc
: code
The dataset is provided by ALTA Shared Task 2023 on CodaLab
- training.json - 18k evenly split human/machine generated training set with labels
- validation_data.json - 2k validation set without labels
- validation_sample_output.json - 2k dummy validation output for output formatting reference
- test_data.json - 2k testing set used for leaderboard scoring on CodaLab
- helper.py - helpers for EDA and model files
- model.py - model architecture and dataloading
- eda.ipynb - EDA notebook (all cells are preloaded)
- build_embeddings.py - build and save embeddings for each Transformer model on the training set
- train_weak_learners.py - train the weak learners on the embeddings
- train_meta_learner.py - train the meta-learner on the weak learner predictions of the dataset embeddings
- inference.py - perform inference using the ensemble
The training was done on python >= 3.8.10
on Google Cloud Platform's Vertex Colab GPU for GCE usage on NVIDIA A100 (40 GB). It was also previously tested with GeForce RTX 3060 on WSL2 Ubuntu. The configurations which are detailed below will work out-of-the-box for NVIDIA A100 (40 GB). However, for less performant GPUs, the BATCH_SIZE
will need to be decreased. All adjustable parameters are recorded as constants at the top of the model files, specifically you ca change the BATCH_SIZE
and NUM_EPOCH
in train_weak_learners.py and train_meta_learner.py.
- Run
pip install -r requirements.txt
- Ensure
training.json
exists indataset
folder. - Run
pip build_embeddings.py
to build[CLS]
embedding of the last hidden layer for the dataset using all Transformers (ALBERT, ELECTRA, RoBERTa, XLNet). If your GPU is not great, please reduce theload_batch_size
insrc/model.py:TransformerModel.dataset
. This will produce the embeddings.pt
file,pretrained--dev=False--model=MODEL.pt
, for each of the TransformerMODEL
variants above. - Run
pip train_weak_learners.py
to train each of the Transformer weak learner using the previously produced embeddings. This will save the best weights for each weak learner in the following locationlightning_logs/version_VERSION/checkpoints/model=MODEL--dev=False--epoch=EPOCH-step=STEP--val_loss=VAL_LOSS.ckpt
. - Update the
checkpoints
array in train_meta_learner.py with the best weight path of each weak learner which was produced in the previous step. Note that the checkpoints have to follow the following order:ALBERT, ELECTRA, RoBERTa, XLNet
. - Run
pip train_meta_learner.py
to train the meta-learner Logistic Regression classifier using the best weights of the weak learners. This will save the best weight of the meta-learner.
- Ensure
test_data.json
exists indataset
folder. - Ensure you have the weights for each of the weak learner and the meta-learner from the training step.
- Update the
checkpoints
array (for the weak learners) andlr_checkpoint_path
(for the meta-learner) in inference.py - Run
pip inference.py
. This will produceanswer.json
which contains the inference output.
If our work is useful for your own, you can cite us with the following BibTex entry:
@misc{nguyen2023stacking,
title={Stacking the Odds: Transformer-Based Ensemble for AI-Generated Text Detection},
author={Duke Nguyen and Khaing Myat Noe Naing and Aditya Joshi},
year={2023},
eprint={2310.18906},
archivePrefix={arXiv},
primaryClass={cs.CL}
}