This repository contains the official implementation of our EMNLP 2025 paper: "Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG".
RAGvis is a novel, two-stage Retrieval-Augmented Generation (RAG) framework designed to automate Exploratory Data Analysis (EDA), i.e., data visualization.
RAGvis operates in two primary stages:
-
Offline Knowledge Graph Semantic Enrichment: A knowledge graph is built from a large collection of EDA notebooks and enriched with structured EDA semantics. This process is guided by an LLM using an empirically-developed taxonomy of EDA operations.
-
Online EDA Notebook Generation: For a new, unseen dataset, RAGvis performs the following steps:
- Retrieves relevant EDA operations from the knowledge graph.
- Aligns these retrieved operations with the structure of the new dataset.
- Refines the aligned operations through LLM reasoning.
- Generates and verifies executable Python code using a self-correcting LLM coding agent.
RAGvis provides a simple API for generating code and data visualizations:
from ragvis import RAGvis
ragvis = RAGvis(llm_model='gemini-2.5-pro')
eda_ops = ragvis.retrieve_and_refine_eda_ops(dataset_path='dataset.csv',
num_eda_ops=10,
perform_refinement=True)
code_snippets = ragvis.generate_eda_code(eda_ops)
executed_code_snippets, charts = ragvis.execute_and_fix_eda_code(code_snippets,
charts_export_path='ragvis_charts')RAGvis executes LLM-generated source code on the local machine. Please make sure it is run on a safe and secure environment.
You may change the system configurations such as the graph and embedding stores endpoints from ragvis_config.py.
We use uv to handle the project dependencies.
- Follow the steps HERE to install
uvand Python. - In the main project directory run
uv syncto install required dependencies. - Make sure your LLM API key is set as an environment variable:
- For Gemini:
GEMINI_API_KEY - For ChatGPT:
OPENAI_API_KEY
- For Gemini:
We use Docker containers to set up and populate both stores. Install Docker and simply run:
cd docker && sudo docker compose up -d --build
This will create and run two containers for the embedding and graph stores. If you prefer to manually set up the stores, refer to the instructions HERE.
Run the inference pipeline on the Titanic dataset via example.py:
source .venv/bin/activate
python example.pyPlease use the following to cite our work:
@inproceedings{ragvis,
title = "Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG",
author = "Helali, Mossad and
Luo, Yutai and
Ham, Tae Jun and
Plotts, Jim and
Chaugule, Ashwin and
Chang, Jichuan and
Ranganathan, Parthasarathy and
Mansour, Essam",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
year = "2025",
url = "https://aclanthology.org/2025.emnlp-main.836/",
doi = "10.18653/v1/2025.emnlp-main.836",
pages = "16547--16564",
ISBN = "979-8-89176-332-6"
}
We encourage contributions and bug fixes, please don't hesitate to open a PR or create an issue if you face any bugs. Kindly refer to our Contributions Guidelines.
This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.
