DocuSenseAI is an AI-powered tool designed to query and retrieve relevant documents across various file formats, including PDFs, text files, CSVs, Excel spreadsheets, and images.
- PDF (
.pdf) - Text (
.txt) - CSV (
.csv) - Excel (
.xlsx) - Image (
.png,.jpg,.jpeg,.gif)
When working with textual and image data, I discovered that cosine similarity does not perform well for images. Even when using embedding models like CLIP for both images and text, the latent spaces differ significantly, leading to inaccurate similarity measures.
For images, I use PyTesseract to extract text, followed by the OpenAI API to generate a description of the image. The embeddings of these descriptions are then stored in a vector database. A similar approach is applied to text documents, where a description is generated using the OpenAI API, and its embeddings are stored in the vector database.
Each record in the vector database contains the following metadata:
- Type
- Description
- Content
- Path
The top K documents' metadata is incorporated into the chat history along with the system prompt for the OpenAI API. A retrieval prompt is then added, and the response includes the answer to the query as well as the path to the relevant document.
- Python 3.9+
- Tesseract OCR (system dependency for image text extraction)
- OpenAI API key
macOS:
brew install tesseractUbuntu/Debian:
sudo apt update
sudo apt install tesseract-ocr libtesseract-devWindows:
Download the installer from UB Mannheim and add it to your PATH.
Create a .env file in the project root:
OPENAI_API_KEY=your-api-key-here
Or export it directly:
export OPENAI_API_KEY=your-api-key-herepip install -r requirements.txtfrom docusenseai import DocuSenseAI
dsa = DocuSenseAI()
# Add a document to the collection
dsa.add_document("my_docs", "path/to/report.pdf")
# Query the collection
response = dsa.query("my_docs", "What is the revenue?")
print(response)
# Delete the collection
dsa.delete_collection("my_docs")