DocuSenseAI

Description

DocuSenseAI is an AI-powered tool designed to query and retrieve relevant documents across various file formats, including PDFs, text files, CSVs, Excel spreadsheets, and images.

Supported Document Formats

PDF (.pdf)
Text (.txt)
CSV (.csv)
Excel (.xlsx)
Image (.png, .jpg, .jpeg, .gif)

Motivation

When working with textual and image data, I discovered that cosine similarity does not perform well for images. Even when using embedding models like CLIP for both images and text, the latent spaces differ significantly, leading to inaccurate similarity measures.

Approach

For images, I use PyTesseract to extract text, followed by the OpenAI API to generate a description of the image. The embeddings of these descriptions are then stored in a vector database. A similar approach is applied to text documents, where a description is generated using the OpenAI API, and its embeddings are stored in the vector database.

Metadata

Each record in the vector database contains the following metadata:

Type
Description
Content
Path

Retrieval Process

The top K documents' metadata is incorporated into the chat history along with the system prompt for the OpenAI API. A retrieval prompt is then added, and the response includes the answer to the query as well as the path to the relevant document.

Prerequisites

Python 3.9+
Tesseract OCR (system dependency for image text extraction)
OpenAI API key

Install Tesseract

macOS:

brew install tesseract

Ubuntu/Debian:

sudo apt update
sudo apt install tesseract-ocr libtesseract-dev

Windows:

Download the installer from UB Mannheim and add it to your PATH.

Set up your OpenAI API key

Create a .env file in the project root:

OPENAI_API_KEY=your-api-key-here

Or export it directly:

export OPENAI_API_KEY=your-api-key-here

Install Python dependencies

pip install -r requirements.txt

Usage

from docusenseai import DocuSenseAI

dsa = DocuSenseAI()

# Add a document to the collection
dsa.add_document("my_docs", "path/to/report.pdf")

# Query the collection
response = dsa.query("my_docs", "What is the revenue?")
print(response)

# Delete the collection
dsa.delete_collection("my_docs")

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
utils		utils
.gitignore		.gitignore
README.md		README.md
demo_notebook.ipynb		demo_notebook.ipynb
docusenseai.py		docusenseai.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocuSenseAI

Description

Supported Document Formats

Motivation

Approach

Metadata

Retrieval Process

Prerequisites

Install Tesseract

Set up your OpenAI API key

Install Python dependencies

Usage

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocuSenseAI

Description

Supported Document Formats

Motivation

Approach

Metadata

Retrieval Process

Prerequisites

Install Tesseract

Set up your OpenAI API key

Install Python dependencies

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages