This project is a web-based application designed to detect whether a piece of text is generated by AI or written by a human. The app allows users to input text, select between two different models (unigram and bigram), and receive a prediction. These models are powered by machine learning techniques that analyze the structure of the text.
• Unigram: A unigram is a single word, and unigram models analyze text by considering each word individually. For example, in the sentence "AI is powerful," the unigrams are "AI," "is," and "powerful." Unigram models capture word frequency, but they don't account for word order or relationships between words.
• Bigram: A bigram consists of two consecutive words. A bigram model analyzes text by considering pairs of adjacent words. For the same sentence "AI is powerful," the bigrams would be "AI is" and "is powerful." Bigram models capture more context than unigrams, as they can analyze relationships between consecutive words, which improves the accuracy for certain tasks like AI text detection.
• Unigram Model: Effective when individual word frequencies are important and sufficient for classification. It's a simpler model that performs well in tasks where word context isn't as crucial. In this project, the unigram model is stored in the uni model pkl folder and utilizes both Logistic Regression and Naive Bayes models for AI text detection.
• Bigram Model: Captures relationships between words, providing more context. This model is especially useful for complex text analysis, where understanding the pairing of words improves prediction accuracy. The bigram model is stored in the bi model pkl folder and combines LightGBM and Random Forest models for more nuanced AI text detection.
The original code for training these models can be found in my other repository: AI-Text-Detector-Model. In this repository, the models have been converted into Python scripts and serialized into pickle files for use in the backend of this application.
• Unigram Model: Combines Logistic Regression and Naive Bayes models.
• Bigram Model: Combines LightGBM and Random Forest models.
The model takes the input text, processes it with TF-IDF vectorizers (unigram or bigram depending on the selection), and provides a combined prediction result.
The models were trained on a dataset consisting of 4.5 lakh (450,000+) text samples, including both AI-generated and human-written content. The dataset covers a variety of topics and text lengths to ensure robustness. The dataset used for training can be found here on Kaggle. However, due to the growing complexity of AI-generated content, the model requires further training to enhance its accuracy across a wider range of texts.
To run the project locally, follow these steps:
- Clone the repository:
git clone https://github.com/shashankrxj/AI-Text-Detector.git- Navigate to the project directory:
cd AI-Text-Detector- Install dependencies:
pip install -r requirements.txt- Start the Flask server:
python app.py- Open your browser and visit:
http://127.0.0.1:5000 or http:localhost:5000- Open the app in your browser.
- Choose between the Unigram or Bigram model.
- Enter the text you want to analyze.
- Click the "Submit" button.
- The model will provide a prediction for the input text.
The model is trained on a dataset of 4.5 lakh text samples, but it is still evolving and may need more data to achieve higher accuracy in predicting whether text is AI-generated or human-written. As such, predictions might be incorrect in some cases. Please use the results with caution, especially in critical applications.
This project is licensed under the MIT License. See the LICENSE file for details.
