File-to-Vector

Note: This repository is preserved as a technical showcase. Built in early 2025, it provides a simple GUI for uploading documents to Supabase pgvector, functionality now better served by LangChain, LlamaIndex, or native platform features.

Description

File-to-Vector is a lightweight, user-friendly application designed to streamline the process of converting documents into vector embeddings for storage in a Supabase database. It enables researchers and developers to build semantic search and retrieval-augmented generation (RAG) pipelines without writing complex data processing code.

The tool supports a variety of file types including PDFs, DOCX, and spreadsheets, leveraging state-of-the-art embedding models from Cohere and OpenAI to generate high-quality vector representations.

Key Features

Multi-Format Support: Seamlessly extracts text from PDFs (including tables via pdfplumber), Word documents, and CSV files.
Dual Embedding Providers: Choose between Cohere's embed-english-v3.0 or OpenAI's text-embedding-3-small models based on your use case.
Intelligent Chunking: Automatically splits documents into semantically meaningful chunks using NLTK sentence tokenization.
Real-Time Monitoring: Track upload progress with live progress bars and success notifications.
Direct Supabase Integration: Vectors are stored directly into your Supabase pgvector-enabled database for immediate use in RAG applications.

Technical Architecture

The system is built on a modular Python stack optimized for simplicity and extensibility:

Streamlit: Provides an interactive, multi-page frontend for file uploads, database monitoring, and configuration.
PyMuPDF & pdfplumber: Handle PDF text and table extraction with high fidelity.
python-docx: Parses Word documents to extract paragraph text.
Cohere / OpenAI SDKs: Generate semantic embeddings via API calls.
Supabase Python Client: Manages authentication and vector storage operations against a PostgreSQL database with the pgvector extension.

Application Workflow

File-to-Vector follows a straightforward pipeline from document ingestion to vector storage:

1. Phase: Document Upload

Users upload files through the Streamlit interface. The application detects the file type and routes it to the appropriate text extraction module.

2. Phase: Text Extraction & Chunking

Text is extracted from the uploaded document using specialized parsers. The content is then split into chunks (default: 300 characters) using sentence-aware tokenization to preserve semantic coherence.

3. Phase: Embedding Generation

Each text chunk is sent to the selected embedding provider (Cohere or OpenAI). The resulting vectors are padded or truncated to match the expected database dimension.

4. Phase: Vector Storage

The embeddings, along with the original text and metadata, are inserted into the configured Supabase table. Progress is displayed in real-time, and users receive confirmation upon completion.

Installation & Setup

# Clone the repository
git clone https://github.com/jackvandervall/file-to-vector.git

# Install dependencies (Recommend using a virtual environment)
pip install -r requirements.txt

# Navigate to the application directory
cd app

# Configure Environment Variables
# Create a .env file or use the Upload tab to enter your Supabase URL, service_role key, and embedding API keys

# Launch the Application
streamlit run main.py

Credits

Developed by Jack van der Vall during an internship at the Erasmus Centre for Data Analytics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File-to-Vector

Description

Key Features

Technical Architecture

Application Workflow

1. Phase: Document Upload

2. Phase: Text Extraction & Chunking

3. Phase: Embedding Generation

4. Phase: Vector Storage

Installation & Setup

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
app		app
config		config
data		data
README.md		README.md
requirements.txt		requirements.txt

jackvandervall/file-to-vector

Folders and files

Latest commit

History

Repository files navigation

File-to-Vector

Description

Key Features

Technical Architecture

Application Workflow

1. Phase: Document Upload

2. Phase: Text Extraction & Chunking

3. Phase: Embedding Generation

4. Phase: Vector Storage

Installation & Setup

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages