Detect-AI-Generated-Text

A Top 5% Solution (Silver) on A Kaggle Competition

Competition Here: https://www.kaggle.com/competitions/llm-detect-ai-generated-text

Competition Overview

In recent years, the development of Large Language Models (LLMs) is becoming matured, making the text they generate increasingly difficult to distinguish from human writing. The competition required participants to develop a machine learning model capable of accurately detecting whether an essay was written by a student or an LLM. The competition dataset included essays written by students and articles generated by various LLMs. This competition was a typical binary classification problem, with the evaluation metric being AUC.

Evaluation Metric

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

Dataset Usage

Official Dataset: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data?select=train_essays.csv
Pile and Ultra: https://www.kaggle.com/datasets/canming/piles-and-ultra-data
Human vs. LLM Text Corpus: https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus
DAIGT V2 Train Dataset: https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset

Algorithm Descriptions

First, a linear model based on a specific argumentative essay dataset (DAIGT V2 Train Dataset) with a distribution similar to the competition dataset:
- a)Filtering out repetitive data based on similarity;
- b)Pre-training a tokenizer with test set texts, then tokenizing training and test set texts to obtain statistical features with a consistent vocabulary;
- c)After tokenization, using TFIDF to obtain Ngram (3,5) text statistical feature vectors;
- d)Inputting the above features into an ensemble classifier composed of MultinomialNB and SGDClassifier for training, then predicting the results.
Second, a LLM based on large datasets (Pile and Ultra, Human vs. LLM Text Corpus):
- a)Collecting open-source data from the internet, coming from both human writings and LLM dialogues;
- b)After simple processing of the large-scale data, inputting it into the text binary classification model deberta-v3-small for fine-tuning, obtaining the trained weights, and then performing inference on Kaggle.
Third, open-source language models (open-source results from other participants, mainly used for ensemble): Using third-party datasets for language model fine-tuning training completion, then predicting results on Kaggle.
- Reference Model Here: https://www.kaggle.com/code/mustafakeser4/train-detectai-distilroberta-0-927
Fourth, ensemble prediction: weighted fusion of the prediction results of the three modeling methods after rank scaling to obtain the final prediction.(according to the leaderboard)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Detect-AI-Generated-Text

A Top 5% Solution (Silver) on A Kaggle Competition

Competition Overview

Evaluation Metric

Dataset Usage

Algorithm Descriptions

Files

README.md

Latest commit

History

README.md

File metadata and controls

Detect-AI-Generated-Text

A Top 5% Solution (Silver) on A Kaggle Competition

Competition Overview

Evaluation Metric

Dataset Usage

Algorithm Descriptions