BIRD-INTERACT 1.0

🌐 Language

BIRD-INTERACT 1.0

⚠️ Announcement

Please note that before your evaluation process, when Docker loads the databases, errors may occasionally occur due to environment inconsistency (these will not terminate the process but will appear in the Docker logs). As a result, some databases may fail to load properly, leading to empty databases. This will cause the evaluation results to be abnormally low.
👉 Therefore, we strongly recommend checking the Docker logs for any errors before running the evaluation and verifying that all databases have been successfully loaded.

📰 News

[2025-08-26] 🚀 We're excited to announce the release of the BIRD-Interact-Full (600) set!
It's a tough one — the best LLMs are only achieving a 16.33% success rate, with just 10.0% on the c-interact and a-interact portions.
👉 For more details, please visit our project website.
[2025-08-26] 📬 We'll be sending the Ground Truth & Test cases to our mailing list this week.
If you want early access, please send an email as instructed on the site for an automatic download.
[2025-08-26] 💾 On another note, we've also released a SQLite version of LiveSQLBench-Lite for easier local research.
The full LiveSQLBench-Base and -Large versions are coming soon!
[2025-08-22] Bug Fix: In Bird-Interact-Agent code, we fixed a bug that when evaluating phase-2 SQL, the stored phase-1 SQL cannot be executed successfully, leading to a lower success rate of Phase-2. This bug only affects those tasks where phase1 sql does some operations on the database, e.g. CREATE table, etc.

🧸 Overview

BIRD-INTERACT, an interactive text-to-SQL benchmark, re-imagines Text-to-SQL evaluation via lens of dynamic interactions. The environment blends a hierarchical knowledge base, database documentation and a function-driven user simulator to recreate authentic enterprise environments across full CRUD operations. It offers two rigorous test modes: (1) passive Conversational Interaction and (2) active Agentic Interaction, spanning 600 annotated tasks including Business Intelligence (BI), CRUD operations and etc., each guarded by executable test cases. Typical evaluations trigger 1,968-5,496 interaction turns between model and user simulator, while state-of-the-art reasoning models currently solve only ≈24% and ≈18% of tasks, underscoring the benchmark's challenge.

✅ Two Evaluation Modes

BIRD-INTERACT supports two evaluation modes as mentioned above:

c-Interact: Conversational Interaction which is a passive mode and the workflow is fixed. The code and detailed information can be found in bird_interact_conv.
a-Interact: Agentic Interaction which is an embodied active mode where the workflow is dynamic and led by models. The code and detailed information can be found in bird_interact_agent.

🐣 Lite Version

We are releasing a lite version of BIRD-INTERACT, bird-interact-lite-exp, which includes 270 high-quality real-world tasks specifically for PostgreSQL. This is a good starting point for quick experimentation.

🦜 Full Version

The full version of BIRD-INTERACT, bird-interact-full, is a comprehensive benchmark that includes 600 tasks for PostgreSQL. It covers a wide range of SQL operations and user queries. The full version is coming soon.

Model Performance Results on BIRD-INTERACT Lite

1. c-Interact Performance

Rank	Model Name	Normalized Reward	Level
1	o3-mini	33.04	🏆 Excellent Chat
2	GPT-4o	30.33	💎 Good Chat
3	Gemini-2.0-flash	27.41	💎 Good Chat
4	Claude-3.7-sonnet	26.60	✨ Standard
5	DeepSeek-R1	21.74	✨ Standard
6	Qwen3	20.33	⚪ Basic
7	DeepSeek-V3	15.85	⚪ Basic

2. a-Interact Performance

Rank	Model Name	Budget Parameters*	Avg Steps/Task	Avg Cost (USD)/Task	Normalized Reward	Level
1	Claude-3.7-sonnet	6/6	15.4	$0.6668	29.19	🏆 Excellent Interaction
2	o3-mini	6/6	7.8	$0.0754	21.07	💎 Good Interaction
3	DeepSeek-V3	6/6	15.6	$0.0629	19.19	💎 Good Interaction
4	Qwen3	6/6	12.5	$0.0278	18.74	✨ Standard
5	GPT-4o	6/6	15.3	$0.4594	18.37	✨ Standard
6	Gemini-2.0-flash	6/6	13.2	$0.0337	17.26	⚪ Basic
7	DeepSeek-R1	6/6	12.0	$0.0931	17.07	⚪ Basic

* Budget Parameters: Starting Budget/User Patience Budget, measured by our virtual currency bird-coins . Refer to bird_interact_agent/README.md for more details.

Interaction-Time Scaling (ITS)

Interaction-Time Scaling (ITS) refers to a model's ability to continuously increase its end performance through multi-turn interactions. When this interactive performance surpasses the model's idealized single-turn performance on a fully specified, unambiguous task, we say it satisfies the ITS law. As user patience grows and interaction turns accumulate, performance keeps improving, demonstrating that the model can sustain effective communication over extended dialogue. Currently, we only find claude-3-7-sonnet satisfies the ITS law.

📦 Dataset Details

Dataset Description

Database: The complete PostgreSQL database can be download from the Google Drive. Check the Quick Eval section for more details.
data: Each data instance contain the following main parts:
- selected_database: The name of the database.
- query: The unambiguous user query.
- amb_user_query: The user query with injected ambiguities.
- user_query_ambiguity: The ambiguities injected into the user query.
- non_critical_ambiguity: The non-critical ambiguities like order, limit, etc.
- knowledge_ambiguity: The ambiguities created by masked external knowledges.
- sol_sql: The ground truth SQL solution.
- preprocess_sql: SQL queries to run before executing the solution or prediction.
- clean_up_sql: SQL queries to run after the test cases to revert any changes made to the database.
- test_cases: A set of test cases to validate the predicted corrected SQL.
- follow_up: The labeled follow up questions.
- external_knowledge: The external knowledge related to the specific task.
evaluation: The evaluation code is available in the ./evaluation directory.
Curated by: BIRD Team & Google Cloud
License: cc-by-sa-4.0
HuggingFace Dataset Card: bird-interact-lite

Dataset Uses

To avoid data leakage by auto-crawling, we do not include GT solution sqls and test cases along with data. please email bird.bench25@gmail.com with the tag [bird-interact-lite GT&Test Cases] in title for full set, which will be sent automatically.

Folder Structure

.
├── LICENSE
├── README.md
├── bird_interact_conv
│   ├── ...
│   └── README.md
├── bird_interact_agent
│   ├── ...
│   └── README.md
├── evaluation
│   ├── docker-compose.yml
│   ├── env
│   ├── postgre_table_dumps
│   ├── run
│   └── src
├── materials
│   ├── ...
└── requirements.txt

The details about running a-interact can be found in ./bird_interact_agent/README.md; and c-interact can be found in ./bird_interact_conv/README.md.

📋 Todo Lists

Release lite version, bird-interact-lite (270).
Release conversational version, bird-interact-conv.
Release agent version, bird-interact-agent.
Release Full bird-interact-full (600).
SFT / RL an User Simulator

Acknowledgement

We would like to express our sincere gratitude to Irina Saparina, Mohammadreza Pourreza, Mehdi Bouzouina, Hailong Li, Jiatong Shi, and Professor Shinji Watanabe for their fruitful discussions and valuable insights that helped improve this project.

Created By:

BIRD Team & Google Cloud

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BIRD-INTERACT 1.0

⚠️ Announcement

📰 News

🧸 Overview

✅ Two Evaluation Modes

🐣 Lite Version

🦜 Full Version

Model Performance Results on BIRD-INTERACT Lite

1. c-Interact Performance

2. a-Interact Performance

Interaction-Time Scaling (ITS)

📦 Dataset Details

Dataset Description

Dataset Uses

Folder Structure

📋 Todo Lists

Acknowledgement

Created By:

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
bird_interact_agent		bird_interact_agent
bird_interact_conv		bird_interact_conv
evaluation		evaluation
materials		materials
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

bird-bench/BIRD-Interact

Folders and files

Latest commit

History

Repository files navigation

BIRD-INTERACT 1.0

⚠️ Announcement

📰 News

🧸 Overview

✅ Two Evaluation Modes

🐣 Lite Version

🦜 Full Version

Model Performance Results on BIRD-INTERACT Lite

1. c-Interact Performance

2. a-Interact Performance

Interaction-Time Scaling (ITS)

📦 Dataset Details

Dataset Description

Dataset Uses

Folder Structure

📋 Todo Lists

Acknowledgement

Created By:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages