Intelligent Document Retrieval System

This repository contains a modularized document retrieval system with two interfaces:

Command-Line Interface (CLI)
Backend API

Both systems support fetching documents from local files, Confluence pages, and MantisBT issues, and use vector databases (Chroma, PostgreSQL, or Elasticsearch) for efficient querying.

Features

Document Sources:
- Local files (*.pdf, *.txt, *.html)
- Confluence pages
- MantisBT issues
- Chat history from previous sessions
Vector Databases:
- Chroma
- PostgreSQL with pgvector
- Elasticsearch
Embeddings:
- Hugging Face
- OpenAI
- Ollama
Query Options:
- Single-query retrieval
- Multi-query retrieval
Session Management:
- Chat history stored as JSON files (for CLI)
- Shared documents for multiple sessions

Requirements

Python 3.12+
PostgreSQL (if using pgvector)
Confluence API (optional)
MantisBT API (optional)
Elasticsearch (optional)

Configuration

The system reads configuration values from environment variables, typically stored in a .env file. Below are the configuration settings you can customize:

Environment Variables:

Variable	Description	Default
`DB_TYPE`	Type of the vector store (`chroma`, `postgres`, `elasticsearch`)	`chroma`
PostgreSQL Configuration
`POSTGRES_HOST`	Host for PostgreSQL	`localhost`
`POSTGRES_PORT`	Port for PostgreSQL	`5432`
`POSTGRES_DB`	PostgreSQL database name	`mydatabase`
`POSTGRES_USER`	PostgreSQL user name	`postgres`
`POSTGRES_PASSWORD`	PostgreSQL password	`password`
`POSTGRES_CONNECTION_STRING`	Dynamically generated connection string for PostgreSQL	`postgresql://postgres:password@localhost:5432/mydatabase`
Chroma Configuration
`CHROMA_COLLECTION_NAME`	Name for Chroma collection	`my-collection`
Elasticsearch Configuration
`ELASTICSEARCH_URL`	URL for Elasticsearch instance	`http://elasticsearch:9200`
`ELASTICSEARCH_INDEX`	Elasticsearch index name	`my_index`
`ELASTICSEARCH_USERNAME`	Username for Elasticsearch	`user`
`ELASTICSEARCH_PASSWORD`	Password for Elasticsearch	`password`
Embedding Model Configuration
`EMBEDDING_MODEL`	Embedding model to use (e.g., `huggingface`, `openai`)	`huggingface`
`EMBEDDING_MODEL_NAME`	Name of the embedding model (HuggingFace, OpenAI)	`all-MiniLM-L6-v2`
LLM Model Configuration
`LLM_MODEL`	The LLM model name (e.g., `llama3.2`)	`llama3.2`
`LLM_PROVIDER`	The LLM provider name (e.g., `ollama`)	`ollama`
Data Storage Configuration
`DATA_DIR`	Directory for storing documents and chat history	`./data/`
`SESSION_FILE`	Path for saving chat history (JSON format)	`./data/chat_history.json`
Confluence Configuration
`CONFLUENCE_API_URL`	Base URL for the Confluence API	None
`CONFLUENCE_API_KEY`	API key for Confluence	None
`CONFLUENCE_API_USER`	API user for Confluence	None
`CONFLUENCE_PAGE_IDS`	List of Confluence page IDs (comma-separated)	None
MantisBT Configuration
`MANTIS_API_URL`	Base URL for MantisBT API	None
`MANTIS_API_KEY`	API key for MantisBT	None
Document Retrieval Configuration
`USE_HISTORY`	Enable chat history for continuity in sessions	`False`
`USE_MANTIS`	Retrieve data from MantisBT	`False`
`USE_CONFLUENCE`	Retrieve data from Confluence	`False`
`USE_MULTIQUERY`	Enable multi-query retrieval for better results	`True`

Critical Environment Variables

The following critical environment variables must be set for the system to function properly. Missing values will trigger warnings in the logs.

Variable	Description
`POSTGRES_CONNECTION_STRING`	Connection string for PostgreSQL
`MANTIS_API_URL`	Base URL for MantisBT API
`MANTIS_API_KEY`	API key for MantisBT
`CONFLUENCE_API_URL`	Base URL for Confluence API
`CONFLUENCE_API_KEY`	API key for Confluence

Configuration Handling

The system uses the python-dotenv package to load environment variables from a .env file. Additionally, the helper functions manage these environment variables with proper logging, validation, and fallback defaults.

Helper Functions:
- get_env_str(): Fetches a string environment variable with an optional default.
- get_env_int(): Fetches an integer environment variable, with error handling for invalid values.
- get_env_bool(): Fetches a boolean environment variable (supports true, 1, yes).
- get_env_list(): Fetches a list from a comma-separated string.

Sample `.env` File:

Here is a sample .env file format for configuring the system:

DB_TYPE=
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=mydatabase
POSTGRES_USER=postgres
POSTGRES_PASSWORD=password
CHROMA_COLLECTION_NAME=my-collection
ELASTICSEARCH_URL=http://elasticsearch:9200
ELASTICSEARCH_INDEX=my_index
ELASTICSEARCH_USERNAME=user
ELASTICSEARCH_PASSWORD=password
EMBEDDING_MODEL=huggingface
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
LLM_MODEL=llama3.2
DATA_DIR=./data/
SESSION_FILE=./data/chat_history.json
CONFLUENCE_API_URL=
CONFLUENCE_API_KEY=
CONFLUENCE_API_USER=
CONFLUENCE_PAGE_IDS=
MANTIS_API_URL=
MANTIS_API_KEY=
USE_HISTORY=False
USE_MANTIS=False
USE_CONFLUENCE=False
USE_MULTIQUERY=True

This configuration setup ensures flexibility and ease of integration with various data sources, databases, and machine learning models.

Choosing Document Sources and Options

You can configure which document sources and features you want to use via environment variables. These settings apply to both the CLI and API versions.

Running the CLI Version

Start the CLI app:

To start the CLI app, run the following command:
```
python cli_app.py
```
Chat History:
- Chat history is saved to ./data/chat_history.json.
- This file will be included in subsequent sessions, allowing you to maintain continuity in conversations.

Running the Backend API Version

Start the API server:
```
python api_app.py
```

API Endpoints:

Health Check:
GET /health

Example request:

curl http://localhost:5000/health

Example response:

{
  "status": "ok"
}

Ask a Question:
POST /ask

Example request:

curl -X POST -H "Content-Type: application/json" -d '{"question": "What is the project about?"}' http://localhost:5000/ask

Example response:

{
  "question": "What is the project about?",
  "answer": "This project is about..."
}

Example Use Case

Interactive CLI: Quickly load documents locally or via APIs and ask questions in a terminal.
Backend API: Integrate document-based QA capabilities into a web or mobile app.

Contributing

Fork the repository.
Create a new branch.
Submit a pull request.

Extending the System

Add New Data Sources:
- Implement a new loader function in common/document_loader.py.
- Update the CLI and API to include the new source.
Support Additional Embedding Models:
- Add the integration in common/vectorstore.py.
Custom Retrieval Strategies:
- Modify or extend the logic in common/prompt.py

Logging

The project uses Python’s logging module for better traceability and debugging. Ensure that logging is appropriately configured to capture warnings, errors, and important events. Logs are written to the console by default, but you can configure it to write to files as needed.

Docker Compose

The provided Docker Compose configuration includes services for PostgreSQL, Elasticsearch, and Ollama. However, you can choose which services to keep or modify according to your needs.

Steps for Docker Compose:

Navigate to the dockercompose folder in the repository.
Run the following command to start the services defined in the Docker Compose configuration:
```
docker-compose up -d
```
Initialize the ollama container with a model:

Once you've started Docker Compose, you'll need to run the following command to initialize the ollama container with a model:
```
docker exec -it ollama ollama run llama3.2
```
This will ensure that the llama3.2 model is available in your system.

After these steps, the system will be fully operational, and you can use the ollama model for document retrieval and querying as part of the Dockerized environment.

Future Work

Multimodal Retrieval

In the future, this system will evolve into a multimodal retrieval system that can handle not only text-based documents but also images, audio, tables, figures, and other types of media. This will allow users to query across a diverse set of document types, improving the richness and depth of answers provided by the system.

Key features to be added include:

Image Retrieval: Integration of models like CLIP that can generate embeddings for images and link them with text. This will allow image-based searches alongside traditional text-based searches. Additionally, OCR (Optical Character Recognition) will be incorporated to extract text from images.
Audio Retrieval: Implementation of speech-to-text models (e.g., Google Speech-to-Text or DeepSpeech) to transcribe audio files into text, which can then be used in the same way as textual data for querying.
Table and Figure Extraction: Enhancements to document loaders to support extracting and processing data from tables and figures. This will allow the system to retrieve structured data (e.g., tables from PDFs) and images (e.g., charts and graphs) based on textual queries.
Multimodal Embeddings: Utilization of models like CLIP or similar to create unified embeddings for text, images, and audio. This will enable searching across all modalities using a single query, such as combining image and text-based information to retrieve the most relevant documents and media.

These additions will significantly expand the system's capabilities, enabling more dynamic and comprehensive query handling.

Memory Enhancements

To make interactions with the system more personalized and context-aware, the memory capabilities will be enhanced to provide a richer user experience. Current plans for memory enhancements include:

Contextual Memory: Implementing a memory buffer that stores past interactions, allowing the system to maintain the context of the conversation across multiple queries. This will help to provide more consistent and relevant answers based on previous conversations and queries.
Long-Term Memory: Integrating a long-term memory system that can store important information across sessions, enabling the system to "remember" critical data and adapt responses based on past interactions. This could be stored in a database or vector store for efficient retrieval.
User-Specific Memory: Allowing users to opt into a personalized memory system where the model retains preferences, interests, or frequently asked questions. This will enhance the relevance and personalization of responses in future sessions.

These memory enhancements will make the system more intelligent, capable of adapting over time to better suit the needs of individual users.

Overall Vision

By combining multimodal capabilities with advanced memory features, the system will become more dynamic, intelligent, and user-friendly, offering richer interactions and more powerful document retrieval from a wide variety of data sources.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
common		common
docker-compose		docker-compose
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api.py		api.py
cli.py		cli.py
dockerfile		dockerfile
initialize.py		initialize.py
memory.py		memory.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intelligent Document Retrieval System

Features

Requirements

Configuration

Environment Variables:

Critical Environment Variables

Configuration Handling

Sample `.env` File:

Choosing Document Sources and Options

Running the CLI Version

Running the Backend API Version

Example Use Case

Contributing

Extending the System

Logging

Docker Compose

Steps for Docker Compose:

Future Work

Multimodal Retrieval

Memory Enhancements

Overall Vision

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

akrios-d/rag-llm

Folders and files

Latest commit

History

Repository files navigation

Intelligent Document Retrieval System

Features

Requirements

Configuration

Environment Variables:

Critical Environment Variables

Configuration Handling

Sample .env File:

Choosing Document Sources and Options

Running the CLI Version

Running the Backend API Version

Example Use Case

Contributing

Extending the System

Logging

Docker Compose

Steps for Docker Compose:

Future Work

Multimodal Retrieval

Memory Enhancements

Overall Vision

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Sample `.env` File:

Packages