Retreival-Augmented Generation (RAG) stack that allows a user to have a conversational interface over structured datasets such as Excel files, CSVs or JSONs -- without summarizing or losing granularity of the original data.
Features
-
Ingest various types of data
-
Support queries that require
- Row specific retrieval
- Column based filtering
- Numeric Lookups
- Range based queries
-
Pre- and post-optimization steps to ensure the retrieval preserves exact data values and context.
-
Handle numeric precision and unit consistency.
-
Avoids hallucination from the model
-
Uses FAISS for indexing and semantic search retrieval
-
Evaluates among various indexing strategies to get the best one based on metrics like Precision@1 and MRR@k
-
Hybrid reranker using Cosine similarity + Keyword overlap
-
Evaluates whether a user query requires semantic or symbolic filters (uses Pandas for symbolic filtering) for improved performance
-
Provides provenance for each query so you are sure that your response is from your data itself
-
Uses interpolation to answer range based queries
-
Converts the query into a deterministic plan so that the LLM cannot fabricate any values
-
LLM based conversational interface
-
Maintains the conversation history
-
Streamlit UI for easy uploading and overall UX
-
Normalizes column names
-
Extracts units from headers but keeps a map for indexing later
-
Enforces dtypes
-
Adds a stable __row_id to enable provenance
-
Persists a parquet copy for speed and a schema.json with column metadata and units
-
Convert rows to compact text representations
-
Embed rows with SentenceTransformers
-
Build a FAISS index and save it with a stable mapping to DataFrame row ids
-
Reload the index and run top-k semantic search
-
Parse simple numeric filters directly using regex
-
Use the OpenAI API to produce a json plan which describes what the retriever should do:
- filter
- range
- aggregate
- plot
- semantic search
-
Runs numeric filters precisely with pandas
-
Handles range queries with linear interpolation
-
Runs semantic search via FAISS index and returns exact rows with provenance
-
Provides an optional plot spec that can later be rendered
-
Retrieves the result from earlier module
-
Formats the answer without using an LLM that explains the retrieved result
-
Answers using an optional LLM rephraser that is hard-guardrailed to only use provided rows/values
-
Refuses when there is no supporting data
-
Build multiple semantic indexes using the following strategies
- Only rows
- Rows + Headers
- Cell Facts (Dense)
-
Hybrid reranker: cosine score + keyword overlap
-
Evaluates on a small eval.jsonl that contains sample queries with expected ground truths
-
Reports Precision@1, MRR@K, and median/mean latency per strategy

