This project aims to detect AI-generated text in essays using deep learning models. The project leverages TensorFlow, Keras, and JAX backends, and uses pre-trained models like DeBERTaV3 for text classification.
To set up the environment and install the necessary libraries, run:
!pip install -q keras_nlp==0.6.3 keras-core==0.1.7The project uses the following libraries:
import os
import keras_nlp
import keras_core as keras
import keras_core.backend as K
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.model_selection import StratifiedKFoldThe configuration class CFG is used to define various parameters such as:
Verbose: Verbosity level for the output Device: Specifies whether to use TPU, GPU, or CPU Epochs: Number of training epochs Batch Size: Size of each training batch Learning Rate Scheduler: cosine Class Names: Labels for the classes (e.g., "real" and "fake") Seed: Random seed for reproducibility
Set the random seed for consistent results:
Copy code
keras.utils.set_random_seed(CFG.seed)Automatically detect and initialize hardware (TPU, GPU, or CPU):
def get_device():
# Code to detect TPU, GPU, or CPU
...
strategy, CFG.device = get_device()The dataset contains essays labeled as either AI-generated (fake) or human-written (real). The data is divided into train and test sets. Additionally, external datasets are used for enhancing the model's performance.
The data is divided into stratified folds using StratifiedKFold for cross-validation.
skf = StratifiedKFold(n_splits=CFG.num_folds, shuffle=True, random_state=CFG.seed)The preprocessing step tokenizes and converts the raw text into a format suitable for model input. It uses the DebertaV3Preprocessor from KerasNLP.
preprocessor = keras_nlp.models.DebertaV3Preprocessor.from_preset(
preset=CFG.preset,
sequence_length=CFG.sequence_length,
)The data is loaded and processed using TensorFlow's tf.data.Dataset API, which provides efficient data handling and pipeline construction.
def build_dataset(texts, labels=None, batch_size=32, cache=False, drop_remainder=True, repeat=False, shuffle=1024):
AUTO = tf.data.AUTOTUNE
# Code to build the dataset pipeline
...The class distribution of the dataset is visualized using Matplotlib.
plt.figure(figsize=(8, 4))
df.name.value_counts().plot.bar(color=[cmap(0.0), cmap(0.25), cmap(0.65), cmap(0.9), cmap(1.0)])
plt.xlabel("Class")
plt.ylabel("Count")
plt.title("Class distribution for Train Data")
plt.show()The model is trained using the preprocessed data, and the training process is logged using Weights & Biases (WandB) for better tracking and visualization.
The trained model is used to make predictions on the test set. The results are then compared to the ground truth for evaluation.
TensorFlow and Keras for providing powerful deep learning tools Kaggle for the dataset Weights & Biases for experiment tracking