Skip to content

dukenguyenxyz/synth_detectives

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SynthDetectives@ALTA2023 Stacking the Odds: Transformer-based Ensemble for AI-generated Text Detection

Stacking ensemble of Transformers trained to detect AI-generated text for the ALTA Shared Task 2023.

Abstract: This paper reports our submission under the team name 'SynthDetectives' to the ALTA 2023 Shared Task. We use a stacking ensemble of Transformers for the task of AI-generated text detection. Our approach is novel in terms of its choice of models in that we use accessible and lightweight models in the ensemble. We show that ensembling the models results in an improved accuracy in comparison with using them individually. Our approach achieves an accuracy score of 0.9555 on the official test data provided by the shared task organisers.

Directory structure

  • dataset: dataset
  • src: code

Dataset

The dataset is provided by ALTA Shared Task 2023 on CodaLab

Software

  • helper.py - helpers for EDA and model files
  • model.py - model architecture and dataloading
  • eda.ipynb - EDA notebook (all cells are preloaded)

System Requirement

The training was done on python >= 3.8.10 on Google Cloud Platform's Vertex Colab GPU for GCE usage on NVIDIA A100 (40 GB). It was also previously tested with GeForce RTX 3060 on WSL2 Ubuntu. The configurations which are detailed below will work out-of-the-box for NVIDIA A100 (40 GB). However, for less performant GPUs, the BATCH_SIZE will need to be decreased. All adjustable parameters are recorded as constants at the top of the model files, specifically you ca change the BATCH_SIZE and NUM_EPOCH in train_weak_learners.py and train_meta_learner.py.

Installation

  • Run pip install -r requirements.txt

Training

  • Ensure training.json exists in dataset folder.
  • Run pip build_embeddings.py to build [CLS] embedding of the last hidden layer for the dataset using all Transformers (ALBERT, ELECTRA, RoBERTa, XLNet). If your GPU is not great, please reduce the load_batch_size in src/model.py:TransformerModel.dataset. This will produce the embeddings .pt file, pretrained--dev=False--model=MODEL.pt, for each of the Transformer MODEL variants above.
  • Run pip train_weak_learners.py to train each of the Transformer weak learner using the previously produced embeddings. This will save the best weights for each weak learner in the following location lightning_logs/version_VERSION/checkpoints/model=MODEL--dev=False--epoch=EPOCH-step=STEP--val_loss=VAL_LOSS.ckpt.
  • Update the checkpoints array in train_meta_learner.py with the best weight path of each weak learner which was produced in the previous step. Note that the checkpoints have to follow the following order: ALBERT, ELECTRA, RoBERTa, XLNet.
  • Run pip train_meta_learner.py to train the meta-learner Logistic Regression classifier using the best weights of the weak learners. This will save the best weight of the meta-learner.

Inference

  • Ensure test_data.json exists in dataset folder.
  • Ensure you have the weights for each of the weak learner and the meta-learner from the training step.
  • Update the checkpoints array (for the weak learners) and lr_checkpoint_path (for the meta-learner) in inference.py
  • Run pip inference.py. This will produce answer.json which contains the inference output.

Authors

License

Citation

If our work is useful for your own, you can cite us with the following BibTex entry:

@misc{nguyen2023stacking,
      title={Stacking the Odds: Transformer-Based Ensemble for AI-Generated Text Detection}, 
      author={Duke Nguyen and Khaing Myat Noe Naing and Aditya Joshi},
      year={2023},
      eprint={2310.18906},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

ALTA Shared Task 2023 - Stack Ensemble of Transformers

Topics

Resources

License

Stars

Watchers

Forks