GitHub - Chhavimohitkar65/Document-Retrieval_QueryMorphAi: This project implements a Document Retrieval System that integrates GPT-3.5-turbo for query expansion and answer generation. It fetches and ranks documents based on user queries, leveraging MongoDB for document storage, Redis for caching, and web scraping to keep documents updated. The system is designed to provide fast and accurate search results

🌟 QueryMorphAi

Document Retrieval System with GPT-3.5-turbo

This project implements a Document Retrieval System that integrates GPT-3.5-turbo for query expansion and answer generation. It fetches and ranks documents based on user queries, leveraging MongoDB for document storage, Redis for caching, and web scraping to keep documents updated. The system is designed to provide fast and accurate search results using a combination of modern NLP techniques.

##MongoDB

#user

🌟 Key Features of QueryMorphAi

⚡ FastAPI Setup

Endpoints:
- /health: Verify if the API is running.
- /search: Search for documents based on user queries, expands them using GPT-3.5-turbo, and generates context-based answers from the retrieved documents.

📚 Document Storage in MongoDB

Efficiently stores documents and their embeddings in MongoDB for fast retrieval and ranked search results.

🚀 Redis Cache for Performance

Caches query results using Redis to boost performance for repeated queries. Cache entries expire after one hour, ensuring fresh content.

🧠 Query Expansion with GPT-3.5

Automatically expands user queries using OpenAI's GPT-3.5-turbo model, improving the accuracy of document retrieval.

✨ Answer Generation with GPT-3.5

Provides human-like, context-aware answers based on retrieved documents using GPT-3.5-turbo, enhancing user interactions.

🔍 Advanced Document Similarity Search

Calculates cosine similarity between the query embeddings and document embeddings stored in MongoDB, ranking the results based on similarity scores.

⚠️ User Request Limiting

Each user can make up to 5 search requests. After that, the system triggers an HTTP 429 - Too Many Requests error to prevent abuse.

📰 Automated News Scraping

News articles from BBC, CNN, and The New York Times are scraped using newspaper3k and stored in MongoDB for enhanced, real-time searchability.

🌐 Streamlit Frontend

A user-friendly Streamlit interface allows users to:
- Input search queries.
- View expanded queries and similarity scores.
- Scrape news articles.
- Check API health.

🔄 Background News Scraping

A background thread automatically scrapes news articles when the FastAPI server starts, keeping the document database updated.

🐳 Dockerized for Easy Deployment

Fully containerized using Docker with:
- Python 3.10-slim base image for efficiency.
- All dependencies pre-installed via requirements.txt.
- Exposed ports for Streamlit (8501).
- Runs both FastAPI and Streamlit concurrently in the same container for seamless service.

✅ Health Check Endpoint

The /health endpoint ensures the FastAPI service is up and running, providing an operational check.

🔄 Concurrent FastAPI & Streamlit Execution

Both FastAPI and Streamlit run concurrently using threading, making the system accessible from a single Docker container.

Architecture Diagram And Demonstration video

https://drive.google.com/drive/folders/1L7LG9xm2iWcN8gHYf3tqBO1brAo7lt32?usp=sharing

System Components:

Frontend (UI): Built with Streamlit to offer a user-friendly interface for inputting queries and viewing results.
MongoDB: Stores document data, including content and embeddings, for efficient search and retrieval.
Redis: A caching layer that stores search results temporarily to speed up future queries.
OpenAI GPT-3.5-turbo: Expands user queries and generates answers based on the retrieved documents.
Document Ranking: Combines cosine similarity (via embeddings) and TF-IDF to rank documents.
Web Scraper: Periodically scrapes and updates articles from news websites, storing them in MongoDB.

Technologies Used

Streamlit: Provides the user interface.
MongoDB: A NoSQL database to store and retrieve documents.
Redis: Caching service to store search results temporarily.
OpenAI GPT-3.5-turbo: Expands user queries and generates natural language answers.
Sentence Transformers: Used for generating document and query embeddings.
scikit-learn: Utilized for TF-IDF computation and cosine similarity calculation.
newspaper3k & BeautifulSoup: For scraping and parsing HTML content from news websites.

Prerequisites

Python 3.8 or above
MongoDB (for document storage)
Redis (for caching)
OpenAI API Key

Installation

Clone the repository:

git clone https://github.com/yourusername/21BAI10367_ML.git

Navigate to the project directory:
```
cd 21BAI10367_ML
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Set up your environment variables:

OPENAI_API_KEY=your_openai_api_key
MONGO_URI=your_mongodb_uri

Start MongoDB and Redis services:
- Ensure MongoDB is running locally or on a remote server.
- Start Redis using the following command:
```
redis-server
```
Run the FastAPI server
```
uvicorn main:app --reload
```
Run the Streamlit app
```
streamlit run app.py
```

Usage

1. Search for Documents:

Input a query in the Streamlit app.
GPT-3.5-turbo expands the query and retrieves relevant documents from MongoDB.
Documents are ranked based on cosine similarity and TF-IDF scores.

2. Scrape News Articles:

The web scraper fetches the latest news articles from various sources and stores them in MongoDB.
Click the "Scrape News" button in the UI to trigger scraping manually.

3. Cached Results:

Search results are cached in Redis to speed up repeated queries within a set time frame.

Query Flow Overview

User Query: Input query through the Streamlit UI.
Query Expansion: GPT-3.5-turbo expands the query for more comprehensive results.
Document Retrieval: Documents are fetched from MongoDB and ranked using embeddings and TF-IDF.
Re-ranking and Answer Generation: Refined results are presented, and an answer is generated using GPT-3.5-turbo.
Results Display: Final results are shown in the UI, and the response is cached for future queries.

Future Enhancements

PDF and Word Document Support: Extend to support formats like PDF and Word documents.
Summarization: Add document summarization for quick insights.
Authentication: Implement user authentication for personalized document retrieval.
Scalability: Optimize the system for larger datasets and more concurrent users.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌟 QueryMorphAi

Document Retrieval System with GPT-3.5-turbo

#user

🌟 Key Features of QueryMorphAi

⚡ FastAPI Setup

📚 Document Storage in MongoDB

🚀 Redis Cache for Performance

🧠 Query Expansion with GPT-3.5

✨ Answer Generation with GPT-3.5

🔍 Advanced Document Similarity Search

⚠️ User Request Limiting

📰 Automated News Scraping

🌐 Streamlit Frontend

🔄 Background News Scraping

🐳 Dockerized for Easy Deployment

✅ Health Check Endpoint

🔄 Concurrent FastAPI & Streamlit Execution

Architecture Diagram And Demonstration video

System Components:

Technologies Used

Prerequisites

Installation

Usage

1. Search for Documents:

2. Scrape News Articles:

3. Cached Results:

Query Flow Overview

Future Enhancements

About

Releases

Packages

Languages

License

Chhavimohitkar65/Document-Retrieval_QueryMorphAi

Folders and files

Latest commit

History

Repository files navigation

🌟 QueryMorphAi

Document Retrieval System with GPT-3.5-turbo

#user

🌟 Key Features of QueryMorphAi

⚡ FastAPI Setup

📚 Document Storage in MongoDB

🚀 Redis Cache for Performance

🧠 Query Expansion with GPT-3.5

✨ Answer Generation with GPT-3.5

🔍 Advanced Document Similarity Search

⚠️ User Request Limiting

📰 Automated News Scraping

🌐 Streamlit Frontend

🔄 Background News Scraping

🐳 Dockerized for Easy Deployment

✅ Health Check Endpoint

🔄 Concurrent FastAPI & Streamlit Execution

Architecture Diagram And Demonstration video

System Components:

Technologies Used

Prerequisites

Installation

Usage

1. Search for Documents:

2. Scrape News Articles:

3. Cached Results:

Query Flow Overview

Future Enhancements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages