The full version paper is available for here.
This repository is the official implementation of "Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines"
Data preparation (DP) transforms raw data into a form suitable for downstream applications, typically by composing operations into executable pipelines. Building such pipelines is time-consuming and requires sophisticated programming skills, posing a significant barrier for non-experts. To lower this barrier, we introduce Text-to-Pipeline, a new task that translates NL data preparation instructions into DP pipelines, and PARROT, a large-scale benchmark to support systematic evaluation. To ensure realistic DP scenarios, PARROT is built by mining transformation patterns from production pipelines and instantiating them on 23,009 real-world tables, resulting in ~18,000 tasks spanning 16 core operators. Our empirical evaluation on PARROT reveals a critical failure mode in cutting-edge LLMs: they struggle not only with multi-step compositional logic but also with semantic parameter grounding. We thus establish a strong baseline with Pipeline-Agent, an execution-aware agent that iteratively reflects on intermediate states. While it achieves state-of-the-art performance, a significant gap remains, underscoring the deep, unsolved challenges for PARROT. It provides the essential, large-scale testbed for developing and evaluating the next generation of autonomous data preparation agentic systems.
The overall architecture of the framework is illustrated below:
A high-resolution version is available for download here.
PARROT is a large-scale benchmark designed to evaluate the ability of language models to synthesize multi-step, executable data preparation pipelines from natural language instructions.
Each instance in the dataset is a 5-tuple:
(input_table, instruction, DSL program, executable_code, output_table)
input_table: A real-world table sampled from diverse domains.instruction: A natural language description of the desired transformation.DSL program: A symbolic DSL representing the multi-step pipeline.executable_code: Compiled Python (Pandas) or SQL code.output_table: The expected result of executing the pipeline.
- Realistic: Instructions are generated over real tables sourced from public datasets and forums.
- Executable: All programs are validated by executing the code and verifying output correctness.
- Multi-step: Pipelines include complex sequences such as
filter β groupby β sort β rename. - Multi-backend: Supports DSL, Pandas code, and can also extend to SQL or Spark (via modular compiler backend).
- Diverse: Covers 16 core data preparation operators (e.g.,
pivot,explode,join,dropna, etc.).
The dataset is split into train/, dev/, and test/ directories, each containing:
benchmark.jsonl: Metadata including task descriptions, instructions, intents, and transformation chains.csv_files.zip: Compressed file containing serialized input and target tables for all tasks in CSV format.
- Text-to-Pipeline generation
- Semantic parsing for data pipelines
- Benchmarking LLMs on structured data preparation
- Agent-based data management
| Model | Execution Accuracy (EA) | Program Validity (PV) | Operator Accuracy (OA) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Medium | Hard | Overall | Easy | Medium | Hard | Overall | Easy | Medium | Hard | Overall | |
| Zero-shot LLMs (non-reasoning) | ||||||||||||
| GPT-4o-mini | 75.17 | 59.55 | 49.79 | 62.88 | 87.92 | 73.83 | 62.34 | 76.38 | 67.08 | 70.44 | 69.6 | 69.22 |
| GPT-4o | 89.03 | 79.06 | 72.38 | 71.00 | 89.04 | 79.07 | 72.38 | 81.12 | 64.24 | 70.7 | 74.44 | 69.27 |
| Gemini-2.5-Pro | 80.09 | 62.09 | 52.72 | 66.26 | 91.5 | 77.65 | 67.78 | 80.4 | 68.75 | 64.54 | 67.72 | 66.95 |
| DeepSeek-V3 | 78.52 | 63.79 | 56.07 | 67.19 | 89.49 | 77.09 | 68.62 | 67.31 | 66.7 | 68.32 | 72.62 | 68.54 |
| Zero-shot LLMs (reasoning) | ||||||||||||
| GPT-o3-mini | 82.1 | 68.88 | 67.36 | 72.86 | 92.39 | 81.47 | 79.92 | 84.7 | 67.86 | 70.51 | 72 | 69.91 |
| GPT-o4-mini | 81.87 | 69.17 | 64.02 | 72.36 | 94.85 | 83.31 | 82.01 | 86.79 | 64.47 | 68.76 | 68.61 | 67.36 |
| DeepSeek-R1 | 77.18 | 58.84 | 41.84 | 61.81 | 89.49 | 72.7 | 55.23 | 75.09 | 67.45 | 68.06 | 67.39 | 67.75 |
| Fine-tuned LLMs | ||||||||||||
| Qwen2.5-Coder-1.5B | 67.79 | 59.41 | 50.63 | 60.59 | 77.85 | 70.58 | 64.02 | 71.79 | 67.11 | 70.33 | 66.55 | 68.65 |
| Qwen2.5-Coder-3B | 83.67 | 69.73 | 68.62 | 74.01 | 91.05 | 80.91 | 82.01 | 84.35 | 82.03 | 81.76 | 81.93 | 81.87 |
| Qwen2.5-Coder-7B | 84.12 | 69.87 | 68.20 | 74.15 | 91.5 | 82.18 | 81.59 | 85.07 | 82.96 | 82.96 | 81.59 | 82.53 |
| Model | Execution Accuracy (EA) | Program Validity (PV) | Operator Accuracy (OA) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Medium | Hard | Overall | Easy | Medium | Hard | Overall | Easy | Medium | Hard | Overall | |
| Tool Calling | 71.62 | 58.36 | 47.67 | 60.48 | 86.48 | 66.53 | 58.13 | 71.07 | 67.79 | 47.15 | 35.4 | 51.31 |
| Plan-and-Solving | 63.69 | 43.19 | 30.23 | 47.4 | 74.52 | 52.53 | 38.37 | 57 | 61.04 | 42.83 | 30.09 | 46.36 |
| Chain-of-Tables | 50.67 | 17.9 | 9.3 | 26.27 | 81.76 | 74.32 | 79.07 | 77.39 | - | - | - | - |
| Pipeline-Agent | ||||||||||||
| Β Β - GPT-4o-mini | 70.27 | 61.08 | 54.65 | 62.72 | 88.51 | 76.26 | 67.44 | 78.41 | 67.56 | 52.04 | 41.02 | 54.79 |
| Β Β - GPT-4o | 77.70 | 78.21 | 67.44 | 76.17 | 96.62 | 88.33 | 82.56 | 89.82 | 78.04 | 72.32 | 65.91 | 72.92 |
| Β Β - Deepseek-V3 | 66.89 | 60.7 | 48.84 | 60.49 | 91.21 | 78.99 | 70.93 | 81.26 | 68.47 | 58.18 | 47.53 | 59.42 |
The dataset is publicly available on Hugging Face:
To install requirements:
pip install -r requirements.txt
To quickly set up and run the benchmark, follow the steps below:
Clone the repository and install the required dependencies:
git clone https://github.com/A11en0/Text-to-Pipeline-Benchmark.git
cd Text-to-Pipeline-Benchmark
pip install -r requirements.txtAlternatively, you can set up the environment using the provided YAML file:
conda env create -f environment.yml
conda activate PARROTNavigate to the datas directory and run the data.py script to prepare the dataset:
cd datas
python data.pyBefore running any scripts, please set the PYTHONPATH to the project root directory:
export PYTHONPATH=$(pwd)Note: The benchmark data is not included in this repository. Please download the dataset from the Hugging Face link above if needed. Normally, you do not need to run the following command unless you want to regenerate or customize the benchmark.
To generate the benchmark:
python -m src.main --config config/default_config.yaml --n_tasks 50 --complexity 8The following commands can be used to run each baseline model:
python src/baseline/chain_of_tables/run_reasoning.py --benchmark_subdir test --model gpt-4o-mini --api_key <YOUR_API_KEY>python src/baseline/gpt_agent/run_agent.py --model gpt-4o --data_dir datas/test --num_tasks 10 --api_key <YOUR_API_KEY>python src/baseline/llm_prompt/reasoning.py --benchmark_subdir test --models gpt-4opython src/baseline/nl2pandas/run_pandas.py --model gpt-4o-mini --data_dir datas/testbash src/baseline/nl2sql/run.shThis project is licensed under the MIT License - see the LICENSE file for details.
If you use this benchmark or dataset in your research, please cite the following work:
@article{paper_citation_key,
title={Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines},
author={Author Name and Collaborators},
journal={Journal Name},
year={2025},
volume={XX},
number={YY},
pages={ZZZ},
publisher={Publisher Name}
}