Note: This repository is preserved as a technical showcase. Built in early 2025, it provides a simple GUI for uploading documents to Supabase pgvector, functionality now better served by LangChain, LlamaIndex, or native platform features.
File-to-Vector is a lightweight, user-friendly application designed to streamline the process of converting documents into vector embeddings for storage in a Supabase database. It enables researchers and developers to build semantic search and retrieval-augmented generation (RAG) pipelines without writing complex data processing code.
The tool supports a variety of file types including PDFs, DOCX, and spreadsheets, leveraging state-of-the-art embedding models from Cohere and OpenAI to generate high-quality vector representations.
- Multi-Format Support: Seamlessly extracts text from PDFs (including tables via pdfplumber), Word documents, and CSV files.
- Dual Embedding Providers: Choose between Cohere's
embed-english-v3.0or OpenAI'stext-embedding-3-smallmodels based on your use case. - Intelligent Chunking: Automatically splits documents into semantically meaningful chunks using NLTK sentence tokenization.
- Real-Time Monitoring: Track upload progress with live progress bars and success notifications.
- Direct Supabase Integration: Vectors are stored directly into your Supabase
pgvector-enabled database for immediate use in RAG applications.
The system is built on a modular Python stack optimized for simplicity and extensibility:
- Streamlit: Provides an interactive, multi-page frontend for file uploads, database monitoring, and configuration.
- PyMuPDF & pdfplumber: Handle PDF text and table extraction with high fidelity.
- python-docx: Parses Word documents to extract paragraph text.
- Cohere / OpenAI SDKs: Generate semantic embeddings via API calls.
- Supabase Python Client: Manages authentication and vector storage operations against a PostgreSQL database with the
pgvectorextension.
File-to-Vector follows a straightforward pipeline from document ingestion to vector storage:
Users upload files through the Streamlit interface. The application detects the file type and routes it to the appropriate text extraction module.
Text is extracted from the uploaded document using specialized parsers. The content is then split into chunks (default: 300 characters) using sentence-aware tokenization to preserve semantic coherence.
Each text chunk is sent to the selected embedding provider (Cohere or OpenAI). The resulting vectors are padded or truncated to match the expected database dimension.
The embeddings, along with the original text and metadata, are inserted into the configured Supabase table. Progress is displayed in real-time, and users receive confirmation upon completion.
# Clone the repository
git clone https://github.com/jackvandervall/file-to-vector.git
# Install dependencies (Recommend using a virtual environment)
pip install -r requirements.txt
# Navigate to the application directory
cd app
# Configure Environment Variables
# Create a .env file or use the Upload tab to enter your Supabase URL, service_role key, and embedding API keys
# Launch the Application
streamlit run main.pyDeveloped by Jack van der Vall during an internship at the Erasmus Centre for Data Analytics.
