Skip to content

Latest commit

 

History

History
28 lines (25 loc) · 2.88 KB

README.md

File metadata and controls

28 lines (25 loc) · 2.88 KB

Detect-AI-Generated-Text

A Top 5% Solution (Silver) on A Kaggle Competition

Competition Overview

In recent years, the development of Large Language Models (LLMs) is becoming matured, making the text they generate increasingly difficult to distinguish from human writing. The competition required participants to develop a machine learning model capable of accurately detecting whether an essay was written by a student or an LLM. The competition dataset included essays written by students and articles generated by various LLMs. This competition was a typical binary classification problem, with the evaluation metric being AUC.

Evaluation Metric

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

image

Dataset Usage

Algorithm Descriptions

  • First, a linear model based on a specific argumentative essay dataset (DAIGT V2 Train Dataset) with a distribution similar to the competition dataset:

    • a)Filtering out repetitive data based on similarity;
    • b)Pre-training a tokenizer with test set texts, then tokenizing training and test set texts to obtain statistical features with a consistent vocabulary;
    • c)After tokenization, using TFIDF to obtain Ngram (3,5) text statistical feature vectors;
    • d)Inputting the above features into an ensemble classifier composed of MultinomialNB and SGDClassifier for training, then predicting the results.
  • Second, a LLM based on large datasets (Pile and Ultra, Human vs. LLM Text Corpus):

    • a)Collecting open-source data from the internet, coming from both human writings and LLM dialogues;
    • b)After simple processing of the large-scale data, inputting it into the text binary classification model deberta-v3-small for fine-tuning, obtaining the trained weights, and then performing inference on Kaggle.
  • Third, open-source language models (open-source results from other participants, mainly used for ensemble): Using third-party datasets for language model fine-tuning training completion, then predicting results on Kaggle.

  • Fourth, ensemble prediction: weighted fusion of the prediction results of the three modeling methods after rank scaling to obtain the final prediction.(according to the leaderboard)