This project implements a Document Retrieval System that integrates GPT-3.5-turbo for query expansion and answer generation. It fetches and ranks documents based on user queries, leveraging MongoDB for document storage, Redis for caching, and web scraping to keep documents updated. The system is designed to provide fast and accurate search results using a combination of modern NLP techniques.
- Endpoints:
/health
: Verify if the API is running./search
: Search for documents based on user queries, expands them using GPT-3.5-turbo, and generates context-based answers from the retrieved documents.
- Efficiently stores documents and their embeddings in MongoDB for fast retrieval and ranked search results.
- Caches query results using Redis to boost performance for repeated queries. Cache entries expire after one hour, ensuring fresh content.
- Automatically expands user queries using OpenAI's GPT-3.5-turbo model, improving the accuracy of document retrieval.
- Provides human-like, context-aware answers based on retrieved documents using GPT-3.5-turbo, enhancing user interactions.
- Calculates cosine similarity between the query embeddings and document embeddings stored in MongoDB, ranking the results based on similarity scores.
- Each user can make up to 5 search requests. After that, the system triggers an HTTP
429 - Too Many Requests
error to prevent abuse.
- News articles from BBC, CNN, and The New York Times are scraped using newspaper3k and stored in MongoDB for enhanced, real-time searchability.
- A user-friendly Streamlit interface allows users to:
- Input search queries.
- View expanded queries and similarity scores.
- Scrape news articles.
- Check API health.
- A background thread automatically scrapes news articles when the FastAPI server starts, keeping the document database updated.
- Fully containerized using Docker with:
- Python 3.10-slim base image for efficiency.
- All dependencies pre-installed via
requirements.txt
. - Exposed ports for Streamlit (8501).
- Runs both FastAPI and Streamlit concurrently in the same container for seamless service.
- The
/health
endpoint ensures the FastAPI service is up and running, providing an operational check.
- Both FastAPI and Streamlit run concurrently using threading, making the system accessible from a single Docker container.
https://drive.google.com/drive/folders/1L7LG9xm2iWcN8gHYf3tqBO1brAo7lt32?usp=sharing
- Frontend (UI): Built with Streamlit to offer a user-friendly interface for inputting queries and viewing results.
- MongoDB: Stores document data, including content and embeddings, for efficient search and retrieval.
- Redis: A caching layer that stores search results temporarily to speed up future queries.
- OpenAI GPT-3.5-turbo: Expands user queries and generates answers based on the retrieved documents.
- Document Ranking: Combines cosine similarity (via embeddings) and TF-IDF to rank documents.
- Web Scraper: Periodically scrapes and updates articles from news websites, storing them in MongoDB.
- Streamlit: Provides the user interface.
- MongoDB: A NoSQL database to store and retrieve documents.
- Redis: Caching service to store search results temporarily.
- OpenAI GPT-3.5-turbo: Expands user queries and generates natural language answers.
- Sentence Transformers: Used for generating document and query embeddings.
- scikit-learn: Utilized for TF-IDF computation and cosine similarity calculation.
- newspaper3k & BeautifulSoup: For scraping and parsing HTML content from news websites.
- Python 3.8 or above
- MongoDB (for document storage)
- Redis (for caching)
- OpenAI API Key
-
Clone the repository:
git clone https://github.com/yourusername/21BAI10367_ML.git
-
Navigate to the project directory:
cd 21BAI10367_ML
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up your environment variables:
OPENAI_API_KEY=your_openai_api_key MONGO_URI=your_mongodb_uri
-
Start MongoDB and Redis services:
- Ensure MongoDB is running locally or on a remote server.
- Start Redis using the following command:
redis-server
-
Run the FastAPI server
uvicorn main:app --reload
-
Run the Streamlit app
streamlit run app.py
- Input a query in the Streamlit app.
- GPT-3.5-turbo expands the query and retrieves relevant documents from MongoDB.
- Documents are ranked based on cosine similarity and TF-IDF scores.
- The web scraper fetches the latest news articles from various sources and stores them in MongoDB.
- Click the "Scrape News" button in the UI to trigger scraping manually.
- Search results are cached in Redis to speed up repeated queries within a set time frame.
- User Query: Input query through the Streamlit UI.
- Query Expansion: GPT-3.5-turbo expands the query for more comprehensive results.
- Document Retrieval: Documents are fetched from MongoDB and ranked using embeddings and TF-IDF.
- Re-ranking and Answer Generation: Refined results are presented, and an answer is generated using GPT-3.5-turbo.
- Results Display: Final results are shown in the UI, and the response is cached for future queries.
- PDF and Word Document Support: Extend to support formats like PDF and Word documents.
- Summarization: Add document summarization for quick insights.
- Authentication: Implement user authentication for personalized document retrieval.
- Scalability: Optimize the system for larger datasets and more concurrent users.