RAGvis: Reliable and Cost-Effective Exploratory Data Analysis via Retrieval-Augmented Generation

This repository contains the official implementation of our EMNLP 2025 paper: "Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG".

RAGvis is a novel, two-stage Retrieval-Augmented Generation (RAG) framework designed to automate Exploratory Data Analysis (EDA), i.e., data visualization.

How It Works

RAGvis operates in two primary stages:

Offline Knowledge Graph Semantic Enrichment: A knowledge graph is built from a large collection of EDA notebooks and enriched with structured EDA semantics. This process is guided by an LLM using an empirically-developed taxonomy of EDA operations.
Online EDA Notebook Generation: For a new, unseen dataset, RAGvis performs the following steps:
- Retrieves relevant EDA operations from the knowledge graph.
- Aligns these retrieved operations with the structure of the new dataset.
- Refines the aligned operations through LLM reasoning.
- Generates and verifies executable Python code using a self-correcting LLM coding agent.

How to Use

RAGvis provides a simple API for generating code and data visualizations:

from ragvis import RAGvis

ragvis = RAGvis(llm_model='gemini-2.5-pro')
eda_ops = ragvis.retrieve_and_refine_eda_ops(dataset_path='dataset.csv',
                                             num_eda_ops=10,
                                             perform_refinement=True)
code_snippets = ragvis.generate_eda_code(eda_ops)
executed_code_snippets, charts = ragvis.execute_and_fix_eda_code(code_snippets,
                                                                 charts_export_path='ragvis_charts')

⚠️ Warning, Code Execution!

RAGvis executes LLM-generated source code on the local machine. Please make sure it is run on a safe and secure environment.

Configurations

You may change the system configurations such as the graph and embedding stores endpoints from ragvis_config.py.

Project Setup:

1. Setting up the code repository:

We use uv to handle the project dependencies.

Follow the steps HERE to install uv and Python.
In the main project directory run uv sync to install required dependencies.
Make sure your LLM API key is set as an environment variable:
- For Gemini: GEMINI_API_KEY
- For ChatGPT: OPENAI_API_KEY

2. Setting up the graph and embedding stores:

We use Docker containers to set up and populate both stores. Install Docker and simply run:

cd docker && sudo docker compose up -d --build

This will create and run two containers for the embedding and graph stores. If you prefer to manually set up the stores, refer to the instructions HERE.

3. Running Hello World Example

Run the inference pipeline on the Titanic dataset via example.py:

source .venv/bin/activate
python example.py

Citing Our Work

Please use the following to cite our work:

@inproceedings{ragvis,
    title = "Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG",
    author = "Helali, Mossad  and
      Luo, Yutai  and
      Ham, Tae Jun  and
      Plotts, Jim  and
      Chaugule, Ashwin  and
      Chang, Jichuan  and
      Ranganathan, Parthasarathy  and
      Mansour, Essam",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    year = "2025",
    url = "https://aclanthology.org/2025.emnlp-main.836/",
    doi = "10.18653/v1/2025.emnlp-main.836",
    pages = "16547--16564",
    ISBN = "979-8-89176-332-6"
}

Contributions

We encourage contributions and bug fixes, please don't hesitate to open a PR or create an issue if you face any bugs. Kindly refer to our Contributions Guidelines.

This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmarks		benchmarks
data_profiling		data_profiling
docker		docker
docs		docs
embeddings		embeddings
prompts		prompts
utils		utils
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml
ragvis.py		ragvis.py
ragvis_config.py		ragvis_config.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAGvis: Reliable and Cost-Effective Exploratory Data Analysis via Retrieval-Augmented Generation

How It Works

How to Use

⚠️ Warning, Code Execution!

Configurations

Project Setup:

1. Setting up the code repository:

2. Setting up the graph and embedding stores:

3. Running Hello World Example

Citing Our Work

Contributions

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

google/ragvis

Folders and files

Latest commit

History

Repository files navigation

RAGvis: Reliable and Cost-Effective Exploratory Data Analysis via Retrieval-Augmented Generation

How It Works

How to Use

⚠️ Warning, Code Execution!

Configurations

Project Setup:

1. Setting up the code repository:

2. Setting up the graph and embedding stores:

3. Running Hello World Example

Citing Our Work

Contributions

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages