pdfLLM
is a Retrieval-Augmented Generation (RAG) microservice that allows users to upload, process, and query documents (e.g., PDFs, text files, Word documents, spreadsheets, images) using a FastAPI backend and a Streamlit frontend for debugging. It converts documents to markdown, stores text chunks in a Qdrant vector database, and uses OpenAI embeddings and chat models to answer queries based on document content. The application supports multiple file formats and provides a RESTful API for programmatic access and a web interface for interactive testing.
- Document Processing: Upload and convert files (
.pdf
,.txt
,.doc
,.docx
,.xls
,.xlsx
,.csv
,.jpg
,.jpeg
,.png
,.heic
) to markdown. - Vector Storage: Store document chunks in Qdrant with OpenAI embeddings for efficient retrieval.
- Querying: Search documents or generate chat responses using OpenAI's
gpt-4o-mini
model with context from relevant chunks. - FastAPI Backend: Exposes endpoints for file processing, searching, chatting, listing, and deleting documents.
- Streamlit Frontend: Provides a UI for uploading files, managing documents, chatting, and debugging Qdrant chunks.
- State Persistence: Shares
file_metadata
andchat_sessions
between services viastate.json
.
- Docker: Install Docker and Docker Compose.
- OpenAI API Key: Obtain an API key from OpenAI.
- System Requirements: 4GB RAM, 10GB disk space for containers and data.
- Supported OS: Linux, macOS, or Windows with WSL2.
pdfLLM/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── app/
│ ├── converters/
│ │ ├── __init__.py
│ │ ├── doc_converter.py
│ │ ├── excel_converter.py
│ │ ├── image_converter.py
│ │ ├── pdf_converter.py
│ │ ├── txt_converter.py
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── qdrant_handler.py
│ │ ├── text_processor.py
│ ├── data/
│ │ ├── state.json (generated at runtime)
│ ├── temp_uploads/ (generated at runtime)
│ ├── main.py (FastAPI backend)
│ ├── streamlit_app.py (Streamlit frontend)
-
Clone the Repository:
git clone <repository-url> cd pdfLLM
-
Create
.env
File: Create a.env
file in the project root with your OpenAI API key:echo "OPENAI_API_KEY=your-openai-api-key" > .env
-
Create Data Directory: Ensure the
app/data
directory exists forstate.json
:mkdir -p app/data chmod -R 777 app/data
Deploy the application using Docker Compose, which runs three services:
- rag-service: FastAPI backend on
http://localhost:8000
. - streamlit-service: Streamlit frontend on
http://localhost:8501
. - qdrant: Qdrant vector database on
http://localhost:6333
.
-
Build and Start Containers:
docker-compose up --build
- This builds the Docker image, starts the services, and maps ports
8000
(FastAPI),8501
(Streamlit), and6333
(Qdrant).
- This builds the Docker image, starts the services, and maps ports
-
Verify Services:
docker ps
- Ensure
pdfllm-rag-service-1
,pdfllm-streamlit-service-1
, andpdfllm-qdrant-1
areUp
.
- Ensure
-
Access the Application:
- FastAPI: Test endpoints at
http://localhost:8000
(see FastAPI Endpoints below). - Streamlit: Open
http://localhost:8501
in a browser for the web interface. - Qdrant: Access the REST API at
http://localhost:6333
(optional, for debugging).
- FastAPI: Test endpoints at
-
Stop Containers:
docker-compose down
The FastAPI backend (app/main.py
) provides the following endpoints for programmatic access. All endpoints require a user_id
to scope data to specific users.
Description: Upload and process a file, converting it to markdown, chunking the content, generating embeddings, and storing chunks in Qdrant. The file metadata is saved in state.json
.
Request:
- Content-Type:
multipart/form-data
- Parameters:
file
(UploadFile): The file to process (e.g.,.pdf
,.txt
,.docx
,.xlsx
,.png
). Max size: 200MB.user_id
(str): Unique identifier for the user.
- Example:
curl -X POST http://localhost:8000/process_file \ -F "file=@notes.txt" \ -F "user_id=test_user"
Response:
- Status: 200 OK
- Body:
{ "status": "success", "file_id": "uuid-string", "filename": "notes.txt" }
- Errors:
- 400: File size exceeds 200MB or unsupported format.
- 500: Processing or Qdrant save failure.
Description: Search for relevant document chunks in Qdrant based on a query, using OpenAI embeddings. Optionally filter by a specific file.
Request:
- Content-Type:
multipart/form-data
- Parameters:
query
(str): The search query.user_id
(str): Unique identifier for the user.file_id
(str, optional): Filter results to a specific document.limit
(int, default=5): Maximum number of chunks to return.
- Example:
curl -X POST http://localhost:8000/search \ -F "query=What is this document about?" \ -F "user_id=test_user" \ -F "file_id=uuid-string"
Response:
- Status: 200 OK
- Body:
{ "status": "success", "results": [ { "chunk_id": "uuid-string", "document_id": "uuid-string", "filename": "notes.txt", "parent_section": "Section Title", "chunk_index": 1, "content": "Chunk content...", "score": 0.95 }, ... ] }
- Errors: 500 (Qdrant or embedding failure).
Description: Generate a chat response based on relevant document chunks retrieved from Qdrant, using OpenAI's gpt-4o-mini
model. Optionally filter by specific files.
Request:
- Content-Type:
multipart/form-data
- Parameters:
query
(str): The user’s question.user_id
(str): Unique identifier for the user.file_ids
(List[str], optional): List of document IDs to filter context.
- Example:
curl -X POST http://localhost:8000/chat \ -F "query=What was discussed in the meeting?" \ -F "user_id=test_user" \ -F "file_ids=uuid-string1" \ -F "file_ids=uuid-string2"
Response:
- Status: 200 OK
- Body:
{ "query": "What was discussed in the meeting?", "response": "The meeting discussed... (sourced from notes.txt, Section 1)", "chat_id": "uuid-string", "sources": [ { "filename": "notes.txt", "chunk_index": 1, "score": 0.95 }, ... ] }
- Errors: 500 (Qdrant or OpenAI failure).
Description: List all documents uploaded by a user, retrieved from state.json
.
Request:
- Parameters (query):
user_id
(str): Unique identifier for the user.
- Example:
curl -X GET "http://localhost:8000/documents?user_id=test_user"
Response:
- Status: 200 OK
- Body:
{ "status": "success", "documents": [ { "file_id": "uuid-string", "filename": "notes.txt", "file_type": ".txt", "upload_date": "2025-06-27 20:30:00" }, ... ] }
- Errors: 500 (state file access failure).
Description: Delete a document and its associated Qdrant chunks, updating state.json
.
Request:
- Parameters (path/query):
file_id
(str): The document ID to delete.user_id
(str): Unique identifier for the user.
- Example:
curl -X DELETE "http://localhost:8000/documents/uuid-string?user_id=test_user"
Response:
- Status: 200 OK
- Body:
{ "status": "success", "file_id": "uuid-string" }
- Errors: 500 (Qdrant or state file failure).
The Streamlit interface (http://localhost:8501
) provides:
- Document Management: Upload files, view document list, select documents for context, preview files, and delete documents.
- Chat Interface: Create chat sessions, send queries, and view responses with source citations.
- Debug Interface: Inspect file metadata and Qdrant chunks for a specific document (
?page=debug&file_id=uuid-string
). - State Persistence: Syncs with FastAPI via
app/data/state.json
forfile_metadata
andchat_sessions
.
Usage:
- Open
http://localhost:8501
. - Enter a
user_id
(e.g.,test_user
). - Upload a file (e.g.,
notes.txt
,document.pdf
). - Select documents via checkboxes for chat context.
- Create a chat session and send queries.
- Use the debug page to inspect Qdrant chunks and metadata.
- Test cross-service consistency (e.g., upload via FastAPI, view in Streamlit).
-
Containers Not Running:
docker ps docker logs pdfllm-rag-service-1 docker logs pdfllm-streamlit-service-1 docker logs pdfllm-qdrant-1
- Ensure
app/main.py
,app/streamlit_app.py
, andrequirements.txt
exist. - Verify
OPENAI_API_KEY
in.env
.
- Ensure
-
State Sharing Issues:
- Check
app/data/state.json
:cat app/data/state.json chmod -R 777 app/data
- Ensure both services write to
/app/data/state.json
.
- Check
-
FastAPI Errors:
- If endpoints fail, try
gpt-3.5-turbo
inapp/main.py
:model="gpt-3.5-turbo"
- Upgrade to
openai==1.40.0
for async support:echo "openai==1.40.0" >> requirements.txt docker-compose up --build
- If endpoints fail, try
-
Streamlit Issues:
- If documents don’t appear, check:
print(st.session_state.file_metadata)
- If checkboxes fail, verify:
print(st.session_state.selected_docs)
- If documents don’t appear, check:
-
PDF Headings:
- If Qdrant chunks show generic
# PDF Content
inparent_section
, checkapp/converters/pdf_converter.py
. Update it to extract meaningful headings (e.g., from PDF metadata or content structure).
- If Qdrant chunks show generic
- Generic PDF Headings: Chunks may have
# PDF Content
inparent_section
. To fix, modifyapp/converters/pdf_converter.py
to extract specific headings. Share the file for assistance. - OpenAI Sync Calls:
openai==0.27.8
uses synchronous calls, which may slow performance. Upgrade toopenai==1.40.0
for async support if needed.
- Report issues or suggest features via the repository’s issue tracker.
MIT License