daVinci-Dev is a family of large language models trained for agentic software engineering.
This repo provides:
- The paper PDF:
daVinci-Dev.pdf - A high-performance data processing pipeline under
pipeline/that calls the GitHub API to construct contextually-native PR trajectories$\mathcal{D}^{\text{ctx}}_{\text{py}}$
We will open-source model checkpoints on Hugging Face:
| Model | Description | Link |
|---|---|---|
daVinci-Dev-72B |
Final model (agent-native mid-training + env native SFT) | https://huggingface.co/GAIR/daVinci-Dev-72B |
daVinci-Dev-32B |
Final model (agent-native mid-training + env native SFT) | https://huggingface.co/GAIR/daVinci-Dev-32B |
daVinci-Dev-72B-MT |
MT checkpoint (after agent-native mid-training, before SFT) | https://huggingface.co/GAIR/daVinci-Dev-72B-MT |
daVinci-Dev-32B-MT |
MT checkpoint (after agent-native mid-training, before SFT) | https://huggingface.co/GAIR/daVinci-Dev-32B-MT |
Datasets are released through Hugging Face:
| Dataset | Description | Link |
|---|---|---|
daVinci-Dev |
Agent-native trajectories used in our training recipe (as permitted) | https://huggingface.co/datasets/GAIR/daVinci-Dev |
High-level composition (see the paper for details):
-
Contextually-native trajectories
$\mathcal{D}^{\text{ctx}}_{\text{py}}$ (PR-derived, Python variant) -
Environmentally-native trajectories
$\mathcal{D}^{\text{env}}_{\text{pass}}$ (executable rollouts, test-passing subset)
The directory pipeline/ contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build
| Pipeline | Description | Link |
|---|---|---|
| daVinci-Dev Pipeline | a high-performance pipeline used to build |
pipeline/ |
The directory env_traj_utils/ provides utilities for converting environmentally-native trajectories (
| Script | Description |
|---|---|
convert_trajectories.py |
Convert SWE-agent trajectories to XML function calling format |
tokenize_trajectories.py |
Tokenize trajectories and filter by length |
See the env_traj_utils README for quickstart instructions on converting env-native.jsonl to formats compatible with training frameworks like SLIME.
This project is a mixed release:
- PR-derived subset: only includes PRs from repositories detected as having a permissive license in the open-source release.
- Executable rollout subset: derived from SWE-rebench, licensed under CC-BY-4.0.
- daVinci-Dev models: released under Qwen license. Users should verify the licensing status of any generated code before using it in production.
- daVinci-Dev pipeline: released under the Apache-2.0 license.
Downstream users are responsible for ensuring their usage complies with the licenses of the underlying sources.
If you use this work, please cite the daVinci-Dev paper.
@misc{zeng2026davincidevagentnativemidtrainingsoftware,
title={daVinci-Dev: Agent-native Mid-training for Software Engineering},
author={Ji Zeng and Dayuan Fu and Tiantian Mi and Yumin Zhuang and Yaxing Huang and Xuefeng Li and Lyumanshan Ye and Muhang Xie and Qishuo Hua and Zhen Huang and Mohan Jiang and Hanning Wang and Jifan Lin and Yang Xiao and Jie Sun and Yunze Wu and Pengfei Liu},
year={2026},
eprint={2601.18418},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2601.18418},
}


