Skip to content

srupat/RAG-Structured-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG-Structured-Data

Retreival-Augmented Generation (RAG) stack that allows a user to have a conversational interface over structured datasets such as Excel files, CSVs or JSONs -- without summarizing or losing granularity of the original data.

Features

  • Ingest various types of data

  • Support queries that require

    • Row specific retrieval
    • Column based filtering
    • Numeric Lookups
    • Range based queries
  • Pre- and post-optimization steps to ensure the retrieval preserves exact data values and context.

  • Handle numeric precision and unit consistency.

  • Avoids hallucination from the model

  • Uses FAISS for indexing and semantic search retrieval

  • Evaluates among various indexing strategies to get the best one based on metrics like Precision@1 and MRR@k

  • Hybrid reranker using Cosine similarity + Keyword overlap

  • Evaluates whether a user query requires semantic or symbolic filters (uses Pandas for symbolic filtering) for improved performance

  • Provides provenance for each query so you are sure that your response is from your data itself

  • Uses interpolation to answer range based queries

  • Converts the query into a deterministic plan so that the LLM cannot fabricate any values

  • LLM based conversational interface

  • Maintains the conversation history

  • Streamlit UI for easy uploading and overall UX


Architecture

Rough Architecture


Demo Video

RAG Structured Data demo

Pipeline Architecture

Ingestion and Canonicalization

  • Normalizes column names

  • Extracts units from headers but keeps a map for indexing later

  • Enforces dtypes

  • Adds a stable __row_id to enable provenance

  • Persists a parquet copy for speed and a schema.json with column metadata and units

Indexing

  • Convert rows to compact text representations

  • Embed rows with SentenceTransformers

  • Build a FAISS index and save it with a stable mapping to DataFrame row ids

  • Reload the index and run top-k semantic search

Query Planner

  • Parse simple numeric filters directly using regex

  • Use the OpenAI API to produce a json plan which describes what the retriever should do:

    • filter
    • range
    • aggregate
    • plot
    • semantic search

Retriever and Executor

  • Runs numeric filters precisely with pandas

  • Handles range queries with linear interpolation

  • Runs semantic search via FAISS index and returns exact rows with provenance

  • Provides an optional plot spec that can later be rendered

Answerer and Guardrails

  • Retrieves the result from earlier module

  • Formats the answer without using an LLM that explains the retrieved result

  • Answers using an optional LLM rephraser that is hard-guardrailed to only use provided rows/values

  • Refuses when there is no supporting data

Retrieval optimizations & comparisons

  • Build multiple semantic indexes using the following strategies

    • Only rows
    • Rows + Headers
    • Cell Facts (Dense)
  • Hybrid reranker: cosine score + keyword overlap

  • Evaluates on a small eval.jsonl that contains sample queries with expected ground truths

  • Reports Precision@1, MRR@K, and median/mean latency per strategy

About

End-to-end data processing + RAG pipeline specifically designed for structured data with a Streamlit UI

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages