Skip to content

This app allows users to upload and store various file types into their own Supabase vector database locally.

Notifications You must be signed in to change notification settings

jackvandervall/file-to-vector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

File-to-Vector

Note: This repository is preserved as a technical showcase. Built in early 2025, it provides a simple GUI for uploading documents to Supabase pgvector, functionality now better served by LangChain, LlamaIndex, or native platform features.

Description

File-to-Vector is a lightweight, user-friendly application designed to streamline the process of converting documents into vector embeddings for storage in a Supabase database. It enables researchers and developers to build semantic search and retrieval-augmented generation (RAG) pipelines without writing complex data processing code.

The tool supports a variety of file types including PDFs, DOCX, and spreadsheets, leveraging state-of-the-art embedding models from Cohere and OpenAI to generate high-quality vector representations.

Demo

Key Features

  • Multi-Format Support: Seamlessly extracts text from PDFs (including tables via pdfplumber), Word documents, and CSV files.
  • Dual Embedding Providers: Choose between Cohere's embed-english-v3.0 or OpenAI's text-embedding-3-small models based on your use case.
  • Intelligent Chunking: Automatically splits documents into semantically meaningful chunks using NLTK sentence tokenization.
  • Real-Time Monitoring: Track upload progress with live progress bars and success notifications.
  • Direct Supabase Integration: Vectors are stored directly into your Supabase pgvector-enabled database for immediate use in RAG applications.

Technical Architecture

The system is built on a modular Python stack optimized for simplicity and extensibility:

  • Streamlit: Provides an interactive, multi-page frontend for file uploads, database monitoring, and configuration.
  • PyMuPDF & pdfplumber: Handle PDF text and table extraction with high fidelity.
  • python-docx: Parses Word documents to extract paragraph text.
  • Cohere / OpenAI SDKs: Generate semantic embeddings via API calls.
  • Supabase Python Client: Manages authentication and vector storage operations against a PostgreSQL database with the pgvector extension.

Application Workflow

File-to-Vector follows a straightforward pipeline from document ingestion to vector storage:

1. Phase: Document Upload

Users upload files through the Streamlit interface. The application detects the file type and routes it to the appropriate text extraction module.

2. Phase: Text Extraction & Chunking

Text is extracted from the uploaded document using specialized parsers. The content is then split into chunks (default: 300 characters) using sentence-aware tokenization to preserve semantic coherence.

3. Phase: Embedding Generation

Each text chunk is sent to the selected embedding provider (Cohere or OpenAI). The resulting vectors are padded or truncated to match the expected database dimension.

4. Phase: Vector Storage

The embeddings, along with the original text and metadata, are inserted into the configured Supabase table. Progress is displayed in real-time, and users receive confirmation upon completion.

Installation & Setup

# Clone the repository
git clone https://github.com/jackvandervall/file-to-vector.git

# Install dependencies (Recommend using a virtual environment)
pip install -r requirements.txt

# Navigate to the application directory
cd app

# Configure Environment Variables
# Create a .env file or use the Upload tab to enter your Supabase URL, service_role key, and embedding API keys

# Launch the Application
streamlit run main.py

Credits

Developed by Jack van der Vall during an internship at the Erasmus Centre for Data Analytics.

About

This app allows users to upload and store various file types into their own Supabase vector database locally.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages