Skip to content

google/ragvis

RAGvis: Reliable and Cost-Effective Exploratory Data Analysis via Retrieval-Augmented Generation

This repository contains the official implementation of our EMNLP 2025 paper: "Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG".

RAGvis is a novel, two-stage Retrieval-Augmented Generation (RAG) framework designed to automate Exploratory Data Analysis (EDA), i.e., data visualization.

RAGvis Framework Diagram


How It Works

RAGvis operates in two primary stages:

  1. Offline Knowledge Graph Semantic Enrichment: A knowledge graph is built from a large collection of EDA notebooks and enriched with structured EDA semantics. This process is guided by an LLM using an empirically-developed taxonomy of EDA operations.

  2. Online EDA Notebook Generation: For a new, unseen dataset, RAGvis performs the following steps:

    • Retrieves relevant EDA operations from the knowledge graph.
    • Aligns these retrieved operations with the structure of the new dataset.
    • Refines the aligned operations through LLM reasoning.
    • Generates and verifies executable Python code using a self-correcting LLM coding agent.

How to Use

RAGvis provides a simple API for generating code and data visualizations:

from ragvis import RAGvis

ragvis = RAGvis(llm_model='gemini-2.5-pro')
eda_ops = ragvis.retrieve_and_refine_eda_ops(dataset_path='dataset.csv',
                                             num_eda_ops=10,
                                             perform_refinement=True)
code_snippets = ragvis.generate_eda_code(eda_ops)
executed_code_snippets, charts = ragvis.execute_and_fix_eda_code(code_snippets,
                                                                 charts_export_path='ragvis_charts')

⚠️ Warning, Code Execution!

RAGvis executes LLM-generated source code on the local machine. Please make sure it is run on a safe and secure environment.

Configurations

You may change the system configurations such as the graph and embedding stores endpoints from ragvis_config.py.


Project Setup:

1. Setting up the code repository:

We use uv to handle the project dependencies.

  • Follow the steps HERE to install uv and Python.
  • In the main project directory run uv sync to install required dependencies.
  • Make sure your LLM API key is set as an environment variable:
    • For Gemini: GEMINI_API_KEY
    • For ChatGPT: OPENAI_API_KEY

2. Setting up the graph and embedding stores:

We use Docker containers to set up and populate both stores. Install Docker and simply run:

cd docker && sudo docker compose up -d --build

This will create and run two containers for the embedding and graph stores. If you prefer to manually set up the stores, refer to the instructions HERE.

3. Running Hello World Example

Run the inference pipeline on the Titanic dataset via example.py:

source .venv/bin/activate
python example.py

Citing Our Work

Please use the following to cite our work:

@inproceedings{ragvis,
    title = "Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG",
    author = "Helali, Mossad  and
      Luo, Yutai  and
      Ham, Tae Jun  and
      Plotts, Jim  and
      Chaugule, Ashwin  and
      Chang, Jichuan  and
      Ranganathan, Parthasarathy  and
      Mansour, Essam",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    year = "2025",
    url = "https://aclanthology.org/2025.emnlp-main.836/",
    doi = "10.18653/v1/2025.emnlp-main.836",
    pages = "16547--16564",
    ISBN = "979-8-89176-332-6"
}

Contributions

We encourage contributions and bug fixes, please don't hesitate to open a PR or create an issue if you face any bugs. Kindly refer to our Contributions Guidelines.


This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published