A lightweight plagiarism checker using TF-IDF and cosine similarity
This Plagiarism Checker is a simple yet effective tool to compare text documents and detect similarity using TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity. The application provides:
✔ Text input comparison
✔ File upload support (.txt, .pdf, .docx)
✔ Visual similarity percentage & warnings
✔ User-friendly UI powered by Streamlit
🔹 TF-IDF + Cosine Similarity for accurate comparison
🔹 Preprocessing: Lowercasing, stopword removal, and lemmatization
🔹 File support: Extracts text from TXT, PDF, and DOCX files
🔹 Real-time similarity detection with interactive progress bar
🔹 Custom warning messages based on similarity scores
git clone https://github.com/rohitkshirsagar19/plagiarism-checker-basic.git
cd plagiarism-checker-basicpip install -r requirements.txtstreamlit run app.pyThe `preprocessing.py` script processes text by:
✅ Converting to lowercase
✅ Removing special characters
✅ Removing stopwords
✅ Lemmatizing words
The `similarity.py` script:
1️⃣ Converts text into TF-IDF vectors
2️⃣ Calculates cosine similarity
3️⃣ Returns similarity percentage
The `app.py` provides an interactive UI for users to:
📌 Enter text manually OR upload files
📌 View similarity results in real time
📌 Get warnings for potential plagiarism
📝 Python – Core language
📖 NLTK – Text preprocessing
📊 Scikit-learn – TF-IDF & similarity calculations
📄 PyPDF2 / python-docx – File extraction
🌐 Streamlit – Web-based UI
- Fork the repo
- Create a new branch (`feature-xyz`)
- Commit changes
- Push and create a PR
🔹 Open-source under the MIT License
Made with 💖 by [rohitkshirsagar19](https://github.com/rohitkshirsagar19) | ⭐ Star this repo if you like it!

