Skip to content

harsharajkumar/Detect-Fake-Text-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM-Detect-AI-Generated-Text

This project aims to detect AI-generated text in essays using deep learning models. The project leverages TensorFlow, Keras, and JAX backends, and uses pre-trained models like DeBERTaV3 for text classification.

πŸ›  Installation

To set up the environment and install the necessary libraries, run:

!pip install -q keras_nlp==0.6.3 keras-core==0.1.7

πŸ“š Import Libraries

The project uses the following libraries:

import os
import keras_nlp
import keras_core as keras
import keras_core.backend as K
import tensorflow as tf
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.model_selection import StratifiedKFold

βš™οΈ Configuration

The configuration class CFG is used to define various parameters such as:

Verbose: Verbosity level for the output Device: Specifies whether to use TPU, GPU, or CPU Epochs: Number of training epochs Batch Size: Size of each training batch Learning Rate Scheduler: cosine Class Names: Labels for the classes (e.g., "real" and "fake") Seed: Random seed for reproducibility

♻️ Reproducibility

Set the random seed for consistent results:

Copy code
keras.utils.set_random_seed(CFG.seed)

πŸ’Ύ Hardware Detection

Automatically detect and initialize hardware (TPU, GPU, or CPU):

def get_device():
    # Code to detect TPU, GPU, or CPU
    ...
strategy, CFG.device = get_device()

πŸ“ Dataset

The dataset contains essays labeled as either AI-generated (fake) or human-written (real). The data is divided into train and test sets. Additionally, external datasets are used for enhancing the model's performance.

πŸ”ͺ Data Split

The data is divided into stratified folds using StratifiedKFold for cross-validation.

skf = StratifiedKFold(n_splits=CFG.num_folds, shuffle=True, random_state=CFG.seed)

🍽️ Preprocessing

The preprocessing step tokenizes and converts the raw text into a format suitable for model input. It uses the DebertaV3Preprocessor from KerasNLP.

preprocessor = keras_nlp.models.DebertaV3Preprocessor.from_preset(
    preset=CFG.preset, 
    sequence_length=CFG.sequence_length,
)

🍚 DataLoader

The data is loaded and processed using TensorFlow's tf.data.Dataset API, which provides efficient data handling and pipeline construction.

def build_dataset(texts, labels=None, batch_size=32, cache=False, drop_remainder=True, repeat=False, shuffle=1024):
    AUTO = tf.data.AUTOTUNE
    # Code to build the dataset pipeline
    ...

πŸ“Š Visualization

The class distribution of the dataset is visualized using Matplotlib.

plt.figure(figsize=(8, 4))
df.name.value_counts().plot.bar(color=[cmap(0.0), cmap(0.25), cmap(0.65), cmap(0.9), cmap(1.0)])
plt.xlabel("Class")
plt.ylabel("Count")
plt.title("Class distribution for Train Data")
plt.show()

πŸ“ Training

The model is trained using the preprocessed data, and the training process is logged using Weights & Biases (WandB) for better tracking and visualization.

πŸ€– Inference

The trained model is used to make predictions on the test set. The results are then compared to the ground truth for evaluation.

πŸ™Œ Acknowledgements

TensorFlow and Keras for providing powerful deep learning tools Kaggle for the dataset Weights & Biases for experiment tracking

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published