This project implements a Retrieval Augmented Generation (RAG) system that allows users to query a PostgreSQL database using natural language. It leverages Large Language Models (LLMs) from OpenAI, the LangChain framework, a ChromaDB vector store for schema embeddings, and a FastAPI backend, all containerized with Docker.
The system works in two main phases:
-
Schema Ingestion (on application startup):
- The application connects to the specified PostgreSQL database.
- It extracts detailed schema information, including table names, column names, data types, primary/foreign keys, and any available comments or descriptions for tables and columns, as well as view definitions.
- This schema information is then converted into vector embeddings using an OpenAI embedding model.
- These embeddings are stored in a ChromaDB vector database, creating a searchable knowledge base of the database schema.
-
Query Processing (on user request):
- The user submits a natural language query (e.g., "Show me all customers from London" or "What is the total revenue for last month?").
- The system embeds the user's query and uses it to search the ChromaDB vector store for relevant schema information (table names, column names, descriptions) that might be needed to answer the query.
- The original natural language query, along with the retrieved schema context, is passed to a LangChain SQL Agent powered by an OpenAI LLM (e.g., GPT-3.5-turbo or GPT-4).
- The SQL Agent uses this combined information to generate an appropriate SQL query.
- The generated SQL query is executed against the PostgreSQL database.
- The results of the SQL query are returned to the user, effectively answering their natural language question.
This approach allows the LLM to have "awareness" of the database structure, leading to more accurate SQL generation, especially for complex databases or queries.
graph LR
subgraph User Interaction
User[<fa:fa-user> User] -- Natural Language Query --> FastAPI
end
subgraph "Application Service (app)"
FastAPI[<fa:fa-server> FastAPI Backend]
FastAPI -- Embed & Search --> ChromaDB
FastAPI -- Query + Context --> SQLAgent
SQLAgent[<fa:fa-robot> LangChain SQL Agent (LLM)] -- Generate SQL --> SQLAgent
SQLAgent -- Execute SQL --> PostgreSQL
end
subgraph "Vector Store (chroma)"
ChromaDB[<fa:fa-database> ChromaDB Vector Store]
FastAPI -- Store Schema Embeddings (on startup) --> ChromaDB
end
subgraph "Data Store (db)"
PostgreSQL[<fa:fa-database> PostgreSQL Database]
end
subgraph External Services
OpenAI[<fa:fa-cloud> OpenAI API]
FastAPI -- Schema for Embedding --> OpenAI
SQLAgent -- LLM for SQL Gen --> OpenAI
end
%% Data Flows for Querying
User -- "1. Natural Language Query" --> FastAPI
FastAPI -- "2. Embed Query & Retrieve Relevant Schema" --> ChromaDB
ChromaDB -- "3. Relevant Schema Snippets" --> FastAPI
FastAPI -- "4. Query + Schema Context" --> SQLAgent
SQLAgent -- "5. Generate SQL (uses OpenAI)" --> SQLAgent
SQLAgent -- "6. Execute SQL" --> PostgreSQL
PostgreSQL -- "7. SQL Result" --> SQLAgent
SQLAgent -- "8. Formatted Answer" --> FastAPI
FastAPI -- "9. Final Response" --> User
%% Data Flow for Schema Ingestion (Startup)
FastAPI -- "A. Extract Schema" --> PostgreSQL
PostgreSQL -- "B. Schema Details" --> FastAPI
FastAPI -- "C. Embed Schema (uses OpenAI)" --> FastAPI
FastAPI -- "D. Store Schema Embeddings" --> ChromaDB
%% Service Dependencies
FastAPI -.-> OpenAI
SQLAgent -.-> OpenAI
Breakdown of Components:
- User: Initiates a query in natural language.
- FastAPI Backend (
appservice):- Receives the user's query.
- Orchestrates the RAG pipeline using LangChain.
- Handles schema extraction and embedding storage during startup.
- Communicates with ChromaDB, PostgreSQL, and OpenAI.
- ChromaDB (
chromaservice):- Stores vector embeddings of the PostgreSQL database schema.
- Allows for efficient similarity search to find relevant schema parts based on the user's query.
- PostgreSQL Database (
dbservice):- The target database containing the actual data and schema that the user wants to query.
- LangChain SQL Agent (within
appservice):- A specialized agent that uses an LLM (from OpenAI) and the retrieved schema context to generate and execute SQL queries.
- OpenAI API (External):
- Provides the embedding models (to convert text to vectors) and the powerful LLMs (for understanding natural language and generating SQL).
- Natural Language Querying: Interact with your SQL database using plain English.
- Retrieval Augmented Generation (RAG): Enhances LLM accuracy by providing relevant database schema context.
- OpenAI Integration: Utilizes OpenAI's state-of-the-art embedding and language models.
- PostgreSQL Backend: Designed to work with PostgreSQL databases.
- ChromaDB Vector Store: Efficiently stores and retrieves schema embeddings.
- Dockerized: Easy setup and deployment using Docker and Docker Compose.
- FastAPI: High-performance Python web framework for the backend API.
- LangChain: Comprehensive framework for developing LLM-powered applications.
- Automatic Schema Ingestion: Extracts and embeds database schema on startup.
- Health Check Endpoint: Provides status of internal components.
- Docker: Ensure Docker is installed and running on your system. (Install Docker)
- Docker Compose: Ensure Docker Compose (usually included with Docker Desktop) is installed. (Install Docker Compose)
- OpenAI API Key: You will need an active OpenAI API key with credits. You can get one from platform.openai.com.
- (Optional)
psqlor other PostgreSQL client: Useful for directly interacting with the PostgreSQL database for setup, inspection, or debugging.
-
Clone the Repository (if you haven't already):
git clone <repository-url> cd <repository-directory>
-
Configure Environment Variables:
- Copy the example environment file:
cp .env.example .env
- Edit the
.envfile with your specific configurations:Important: The# .env # PostgreSQL Connection Details for docker-compose POSTGRES_USER=your_db_user # Replace with your desired PostgreSQL username POSTGRES_PASSWORD=your_db_password # Replace with your desired PostgreSQL password POSTGRES_DB=your_db_name # Replace with your desired PostgreSQL database name # OpenAI API Key OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Replace with your actual OpenAI API key # Optional: Default is PYTHONUNBUFFERED=1 for better logging in Docker PYTHONUNBUFFERED=1POSTGRES_USER,POSTGRES_PASSWORD, andPOSTGRES_DBvariables in the.envfile are used bydocker-compose.ymlto initialize thedbservice and also by theappservice to connect to it. Theappservice usesDB_HOST=db(the service name) internally, as defined indocker-compose.yml.
- Copy the example environment file:
-
Build and Run with Docker Compose:
- Open your terminal in the project's root directory (where
docker-compose.ymlis located). - Run the following command:
docker-compose up --build -d
--build: Forces Docker to rebuild the images if there are changes (e.g., inDockerfileor source code).-d: Runs the containers in detached mode (in the background).
- Open your terminal in the project's root directory (where
-
Initial Schema Ingestion:
- On the first startup, the
appservice will connect to thedbservice, extract its schema (if any tables/views exist), embed this schema, and store it in thechromaservice. This process might take a few moments depending on the size of your database schema and network speed for OpenAI API calls. - You can monitor the logs of the application service to see this process:
Look for messages indicating initialization of RAG components, schema loading, and connection to ChromaDB.
docker-compose logs -f app
- On the first startup, the
-
(Optional) Initialize Database with Schema/Data:
-
If your PostgreSQL database (
dbservice) is starting empty, the RAG system won't have any specific schema to learn about. You need to populate it with tables, views, and data. -
Method 1: Using an
init.sqlscript (Recommended for initial setup):- Create an SQL script (e.g.,
init.sql) in the root of your project directory withCREATE TABLE,INSERT INTO,CREATE VIEWstatements, etc. - Uncomment the volume mount for
init.sqlindocker-compose.yml:services: db: # ... other db config ... volumes: - postgres_data:/var/lib/postgresql/data/ - ./init.sql:/docker-entrypoint-initdb.d/init.sql # <--- UNCOMMENT THIS
- When you run
docker-compose up(especially for the first time or afterdocker-compose down -v), this script will be executed by the PostgreSQL container, setting up your database.
- Create an SQL script (e.g.,
-
Method 2: Connecting Manually:
- After the
dbservice is running, you can connect to it using a PostgreSQL client likepsql:(Use thepsql -h localhost -p 5432 -U your_db_user -d your_db_name
POSTGRES_USERandPOSTGRES_DByou set in your.envfile. The password will be prompted.) - Then, you can run your SQL commands to create schema and insert data.
- After the
-
Important: After adding or significantly changing the database schema, you should restart the
appservice so it can re-extract and re-embed the schema:docker-compose restart app
-
Once the services are up and running, and the schema has been ingested:
You can send natural language queries to the /query endpoint of the FastAPI application (which is mapped to port 8000 on your host by default).
Request Format:
- Method:
POST - URL:
http://localhost:8000/query - Body (JSON):
{ "natural_language_query": "Your natural language question about the data" }
Example using curl:
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{
"natural_language_query": "How many users signed up last month?"
}'Example Response (Structure):
The response will be a JSON object containing:
{
"natural_language_query": "How many users signed up last month?",
"sql_query": "SELECT COUNT(user_id) FROM users WHERE signup_date >= date_trunc('month', current_date - interval '1 month') AND signup_date < date_trunc('month', current_date);", // Example SQL
"result": "[{\"count\": 120}]", // Example result (often a JSON string of the SQL output)
"context_from_vector_db": "Table: users (Schema: public)\nColumns:\n - user_id (integer): Unique identifier for the user.\n - username (text)\n - signup_date (date): Date when the user signed up.\n...", // Snippets of schema used as context
"error": null // Or an error message if something went wrong
}natural_language_query: Your original query.sql_query: The SQL query generated by the LLM agent (best effort extraction).result: The result returned from executing the SQL query. The format can vary depending on the query and the agent's output.context_from_vector_db: The schema information retrieved from ChromaDB that was provided to the LLM.error: Any error message if the query processing failed.
You can check the status of the application and its components by accessing the /health endpoint.
Example using curl:
curl http://localhost:8000/healthExample Response:
{
"status": "ok", // or "degraded", "error"
"services": {
"llm": "Initialized",
"embeddings_model": "Initialized",
"sql_database_engine": "Initialized",
"vector_store": "Initialized",
"qa_chain": "Initialized",
"sql_agent_executor": "Initialized",
"postgresql_connection": "OK",
"chromadb_connection": "OK"
}
}The following environment variables are crucial for configuring the application, primarily set in the .env file:
POSTGRES_USER: Username for the PostgreSQL database. (Default:user)POSTGRES_PASSWORD: Password for the PostgreSQL database. (Default:password)POSTGRES_DB: Name of the PostgreSQL database. (Default:ragdb)OPENAI_API_KEY: Your API key for OpenAI services (embeddings and LLM). Required.PYTHONUNBUFFERED: Set to1to ensure Python output (like logs) is sent directly to the terminal without buffering, which is useful in Docker. (Default:1)
These variables are used by docker-compose.yml to configure the services. The app service also uses these (or derived versions like DATABASE_URL) to connect to other services. Internally, the app service connects to:
- PostgreSQL via
DB_HOST=db(service name) andDB_PORT=5432. - ChromaDB via
CHROMA_HOST=chroma(service name) andCHROMA_PORT=8000(internal Chroma port).
.
├── .env.example # Example environment variables
├── .env # Actual environment variables (ignored by git)
├── Dockerfile # Docker build definition for the FastAPI application
├── README.md # This file
├── docker-compose.yml # Docker Compose configuration for services (app, db, chroma)
├── main.py # FastAPI application: API endpoints, LangChain RAG logic
├── metadata_extractor.py # Logic for extracting schema from PostgreSQL
├── requirements.txt # Python dependencies
└── init.sql # Optional: Example SQL script to initialize the database
main.py: Core FastAPI application containing API endpoints (/query,/health) and the main RAG orchestration logic using LangChain. Initializes and uses components like the LLM, embedding model, vector store, and SQL agent.metadata_extractor.py: Contains functions to connect to the PostgreSQL database and extract its schema (tables, columns, types, comments, views). This information is then used to populate the vector store.Dockerfile: Instructions for building the Docker image for theappservice. It installs Python, copies the application code, and sets the command to run Uvicorn.docker-compose.yml: Defines and configures the multi-container Docker application:app: The FastAPI/LangChain application.db: The PostgreSQL database service.chroma: The ChromaDB vector store service.- Includes network configuration and volume mounts for data persistence.
requirements.txt: Lists all Python dependencies required for the project..env.example/.env: Used to manage environment-specific configurations like API keys and database credentials.
While Docker is recommended for ease of deployment and consistency, you can run the FastAPI application (main.py) locally for development if you have Python, PostgreSQL, and ChromaDB installed and running on your machine.
- Install Dependencies:
pip install -r requirements.txt
- Set Environment Variables:
- Ensure
OPENAI_API_KEYis set in your environment. - You'll need to configure database connection variables (e.g.,
DB_HOST=localhost,POSTGRES_USER,POSTGRES_PASSWORD,POSTGRES_DB) and ChromaDB connection variables (CHROMA_HOST=localhost,CHROMA_PORTfor the host-exposed Chroma port, e.g., 8001 if using the provided docker-compose for Chroma only). - You can create a
.envfile in the project root andmain.py(if modified or usingpython-dotenvdirectly in itsif __name__ == "__main__":block) can load it.
- Ensure
- Run PostgreSQL and ChromaDB:
- Start your local PostgreSQL instance.
- Start your local ChromaDB instance (or you could run just the
dbandchromaservices fromdocker-compose.yml).
- Run the FastAPI App:
The
# Ensure your .env is configured for local hosts for DB and Chroma # For example, in .env: # DB_HOST=localhost # CHROMA_HOST=localhost # CHROMA_PORT=8001 # Assuming Chroma is mapped to 8001 on host # POSTGRES_PORT=5432 uvicorn main:app --reload --host 0.0.0.0 --port 8000
if __name__ == "__main__":block inmain.pyandmetadata_extractor.pyprovide examples/stubs for how this might be initiated, but the primary execution path is via Uvicorn as specified in theDockerfileandCMD.
- OpenAI API Key Issues:
- Ensure
OPENAI_API_KEYis correctly set in your.envfile and that the key is valid and has credits. - Error messages like
AuthenticationErrororRateLimitErrorfrom OpenAI will appear inapplogs.
- Ensure
- Service Connection Issues (app can't reach db or chroma):
- Check
docker-compose logs app. - Ensure services are running:
docker-compose ps. - Verify Docker networking. Usually, Docker Compose handles this, but complex local network configurations might interfere.
- The
appservice connects todbon hostdbandchromaon hostchroma(their service names within the Docker network).
- Check
- Database Schema Not Loaded / Empty Context:
- Ensure your PostgreSQL database (
dbservice) actually has tables and views defined. If it's empty, the RAG system has no schema to learn. See "Initialize Database with Schema/Data". - Check
docker-compose logs appduring startup for messages frommetadata_extractor.pyabout schema loading. - Restart the
appservice after making schema changes:docker-compose restart app.
- Ensure your PostgreSQL database (
- ChromaDB Errors:
- Check
docker-compose logs chroma. - Ensure the
chroma_datavolume has correct permissions if you've modified its setup.
- Check
- Viewing Logs:
- For the application:
docker-compose logs -f app - For the database:
docker-compose logs -f db - For ChromaDB:
docker-compose logs -f chroma
- For the application:
- Port Conflicts:
- If ports
8000(app),8001(Chroma host mapping), or5432(Postgres host mapping) are in use on your host machine,docker-compose upwill fail. Change the host-side port mapping indocker-compose.yml(e.g.,"8080:8000").
- If ports
(If this were an open project, contribution guidelines would go here.)
(Specify license information here, e.g., MIT License.)