This repository contains a modularized document retrieval system with two interfaces:
- Command-Line Interface (CLI)
- Backend API
Both systems support fetching documents from local files, Confluence pages, and MantisBT issues, and use vector databases (Chroma, PostgreSQL, or Elasticsearch) for efficient querying.
-
Document Sources:
- Local files (
*.pdf
,*.txt
,*.html
) - Confluence pages
- MantisBT issues
- Chat history from previous sessions
- Local files (
-
Vector Databases:
- Chroma
- PostgreSQL with
pgvector
- Elasticsearch
-
Embeddings:
- Hugging Face
- OpenAI
- Ollama
-
Query Options:
- Single-query retrieval
- Multi-query retrieval
-
Session Management:
- Chat history stored as JSON files (for CLI)
- Shared documents for multiple sessions
- Python 3.12+
- PostgreSQL (if using
pgvector
) - Confluence API (optional)
- MantisBT API (optional)
- Elasticsearch (optional)
The system reads configuration values from environment variables, typically stored in a .env
file. Below are the configuration settings you can customize:
Variable | Description | Default |
---|---|---|
DB_TYPE |
Type of the vector store (chroma , postgres , elasticsearch ) |
chroma |
PostgreSQL Configuration | ||
POSTGRES_HOST |
Host for PostgreSQL | localhost |
POSTGRES_PORT |
Port for PostgreSQL | 5432 |
POSTGRES_DB |
PostgreSQL database name | mydatabase |
POSTGRES_USER |
PostgreSQL user name | postgres |
POSTGRES_PASSWORD |
PostgreSQL password | password |
POSTGRES_CONNECTION_STRING |
Dynamically generated connection string for PostgreSQL | postgresql://postgres:password@localhost:5432/mydatabase |
Chroma Configuration | ||
CHROMA_COLLECTION_NAME |
Name for Chroma collection | my-collection |
Elasticsearch Configuration | ||
ELASTICSEARCH_URL |
URL for Elasticsearch instance | http://elasticsearch:9200 |
ELASTICSEARCH_INDEX |
Elasticsearch index name | my_index |
ELASTICSEARCH_USERNAME |
Username for Elasticsearch | user |
ELASTICSEARCH_PASSWORD |
Password for Elasticsearch | password |
Embedding Model Configuration | ||
EMBEDDING_MODEL |
Embedding model to use (e.g., huggingface , openai ) |
huggingface |
EMBEDDING_MODEL_NAME |
Name of the embedding model (HuggingFace, OpenAI) | all-MiniLM-L6-v2 |
LLM Model Configuration | ||
LLM_MODEL |
The LLM model name (e.g., llama3.2 ) |
llama3.2 |
LLM_PROVIDER |
The LLM provider name (e.g., ollama ) |
ollama |
Data Storage Configuration | ||
DATA_DIR |
Directory for storing documents and chat history | ./data/ |
SESSION_FILE |
Path for saving chat history (JSON format) | ./data/chat_history.json |
Confluence Configuration | ||
CONFLUENCE_API_URL |
Base URL for the Confluence API | None |
CONFLUENCE_API_KEY |
API key for Confluence | None |
CONFLUENCE_API_USER |
API user for Confluence | None |
CONFLUENCE_PAGE_IDS |
List of Confluence page IDs (comma-separated) | None |
MantisBT Configuration | ||
MANTIS_API_URL |
Base URL for MantisBT API | None |
MANTIS_API_KEY |
API key for MantisBT | None |
Document Retrieval Configuration | ||
USE_HISTORY |
Enable chat history for continuity in sessions | False |
USE_MANTIS |
Retrieve data from MantisBT | False |
USE_CONFLUENCE |
Retrieve data from Confluence | False |
USE_MULTIQUERY |
Enable multi-query retrieval for better results | True |
The following critical environment variables must be set for the system to function properly. Missing values will trigger warnings in the logs.
Variable | Description |
---|---|
POSTGRES_CONNECTION_STRING |
Connection string for PostgreSQL |
MANTIS_API_URL |
Base URL for MantisBT API |
MANTIS_API_KEY |
API key for MantisBT |
CONFLUENCE_API_URL |
Base URL for Confluence API |
CONFLUENCE_API_KEY |
API key for Confluence |
The system uses the python-dotenv
package to load environment variables from a .env
file. Additionally, the helper functions manage these environment variables with proper logging, validation, and fallback defaults.
- Helper Functions:
get_env_str()
: Fetches a string environment variable with an optional default.get_env_int()
: Fetches an integer environment variable, with error handling for invalid values.get_env_bool()
: Fetches a boolean environment variable (supportstrue
,1
,yes
).get_env_list()
: Fetches a list from a comma-separated string.
Here is a sample .env
file format for configuring the system:
DB_TYPE=
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=mydatabase
POSTGRES_USER=postgres
POSTGRES_PASSWORD=password
CHROMA_COLLECTION_NAME=my-collection
ELASTICSEARCH_URL=http://elasticsearch:9200
ELASTICSEARCH_INDEX=my_index
ELASTICSEARCH_USERNAME=user
ELASTICSEARCH_PASSWORD=password
EMBEDDING_MODEL=huggingface
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
LLM_MODEL=llama3.2
DATA_DIR=./data/
SESSION_FILE=./data/chat_history.json
CONFLUENCE_API_URL=
CONFLUENCE_API_KEY=
CONFLUENCE_API_USER=
CONFLUENCE_PAGE_IDS=
MANTIS_API_URL=
MANTIS_API_KEY=
USE_HISTORY=False
USE_MANTIS=False
USE_CONFLUENCE=False
USE_MULTIQUERY=True
This configuration setup ensures flexibility and ease of integration with various data sources, databases, and machine learning models.
You can configure which document sources and features you want to use via environment variables. These settings apply to both the CLI and API versions.
-
Start the CLI app:
To start the CLI app, run the following command:
python cli_app.py
-
Chat History:
- Chat history is saved to
./data/chat_history.json
. - This file will be included in subsequent sessions, allowing you to maintain continuity in conversations.
- Chat history is saved to
-
Start the API server:
python api_app.py
-
API Endpoints:
-
Health Check:
GET /health
Example request:
curl http://localhost:5000/health
Example response:
{ "status": "ok" }
-
Ask a Question:
POST /ask
Example request:
curl -X POST -H "Content-Type: application/json" -d '{"question": "What is the project about?"}' http://localhost:5000/ask
Example response:
{ "question": "What is the project about?", "answer": "This project is about..." }
-
- Interactive CLI: Quickly load documents locally or via APIs and ask questions in a terminal.
- Backend API: Integrate document-based QA capabilities into a web or mobile app.
- Fork the repository.
- Create a new branch.
- Submit a pull request.
-
Add New Data Sources:
- Implement a new loader function in
common/document_loader.py
. - Update the CLI and API to include the new source.
- Implement a new loader function in
-
Support Additional Embedding Models:
- Add the integration in
common/vectorstore.py
.
- Add the integration in
-
Custom Retrieval Strategies:
- Modify or extend the logic in
common/prompt.py
- Modify or extend the logic in
The project uses Python’s logging module for better traceability and debugging. Ensure that logging is appropriately configured to capture warnings, errors, and important events. Logs are written to the console by default, but you can configure it to write to files as needed.
The provided Docker Compose configuration includes services for PostgreSQL, Elasticsearch, and Ollama. However, you can choose which services to keep or modify according to your needs.
-
Navigate to the
dockercompose
folder in the repository. -
Run the following command to start the services defined in the Docker Compose configuration:
docker-compose up -d
-
Initialize the
ollama
container with a model:Once you've started Docker Compose, you'll need to run the following command to initialize the
ollama
container with a model:docker exec -it ollama ollama run llama3.2
This will ensure that the
llama3.2
model is available in your system.
After these steps, the system will be fully operational, and you can use the ollama
model for document retrieval and querying as part of the Dockerized environment.
In the future, this system will evolve into a multimodal retrieval system that can handle not only text-based documents but also images, audio, tables, figures, and other types of media. This will allow users to query across a diverse set of document types, improving the richness and depth of answers provided by the system.
Key features to be added include:
-
Image Retrieval: Integration of models like CLIP that can generate embeddings for images and link them with text. This will allow image-based searches alongside traditional text-based searches. Additionally, OCR (Optical Character Recognition) will be incorporated to extract text from images.
-
Audio Retrieval: Implementation of speech-to-text models (e.g., Google Speech-to-Text or DeepSpeech) to transcribe audio files into text, which can then be used in the same way as textual data for querying.
-
Table and Figure Extraction: Enhancements to document loaders to support extracting and processing data from tables and figures. This will allow the system to retrieve structured data (e.g., tables from PDFs) and images (e.g., charts and graphs) based on textual queries.
-
Multimodal Embeddings: Utilization of models like CLIP or similar to create unified embeddings for text, images, and audio. This will enable searching across all modalities using a single query, such as combining image and text-based information to retrieve the most relevant documents and media.
These additions will significantly expand the system's capabilities, enabling more dynamic and comprehensive query handling.
To make interactions with the system more personalized and context-aware, the memory capabilities will be enhanced to provide a richer user experience. Current plans for memory enhancements include:
-
Contextual Memory: Implementing a memory buffer that stores past interactions, allowing the system to maintain the context of the conversation across multiple queries. This will help to provide more consistent and relevant answers based on previous conversations and queries.
-
Long-Term Memory: Integrating a long-term memory system that can store important information across sessions, enabling the system to "remember" critical data and adapt responses based on past interactions. This could be stored in a database or vector store for efficient retrieval.
-
User-Specific Memory: Allowing users to opt into a personalized memory system where the model retains preferences, interests, or frequently asked questions. This will enhance the relevance and personalization of responses in future sessions.
These memory enhancements will make the system more intelligent, capable of adapting over time to better suit the needs of individual users.
By combining multimodal capabilities with advanced memory features, the system will become more dynamic, intelligent, and user-friendly, offering richer interactions and more powerful document retrieval from a wide variety of data sources.