daVinci-Dev: Agent-native Mid-training for Software Engineering

daVinci-Dev: Agent-native Mid-training for Software Engineering

Overview

daVinci-Dev is a family of large language models trained for agentic software engineering.

This repo provides:

The paper PDF: daVinci-Dev.pdf
A high-performance data processing pipeline under pipeline/ that calls the GitHub API to construct contextually-native PR trajectories $\mathcal{D}^{\text{ctx}}_{\text{py}}$

Model Zoo

We will open-source model checkpoints on Hugging Face:

Model	Description	Link
`daVinci-Dev-72B`	Final model (agent-native mid-training + env native SFT)	https://huggingface.co/GAIR/daVinci-Dev-72B
`daVinci-Dev-32B`	Final model (agent-native mid-training + env native SFT)	https://huggingface.co/GAIR/daVinci-Dev-32B
`daVinci-Dev-72B-MT`	MT checkpoint (after agent-native mid-training, before SFT)	https://huggingface.co/GAIR/daVinci-Dev-72B-MT
`daVinci-Dev-32B-MT`	MT checkpoint (after agent-native mid-training, before SFT)	https://huggingface.co/GAIR/daVinci-Dev-32B-MT

Datasets

Datasets are released through Hugging Face:

Dataset	Description	Link
`daVinci-Dev`	Agent-native trajectories used in our training recipe (as permitted)	https://huggingface.co/datasets/GAIR/daVinci-Dev

High-level composition (see the paper for details):

Contextually-native trajectories $\mathcal{D}^{\text{ctx}}_{\text{py}}$ (PR-derived, Python variant)
Environmentally-native trajectories $\mathcal{D}^{\text{env}}_{\text{pass}}$ (executable rollouts, test-passing subset)

Pipeline

The directory pipeline/ contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$.

Pipeline	Description	Link
daVinci-Dev Pipeline	a high-performance pipeline used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$	`pipeline/`

Utilities for Environmentally-native Trajectories

The directory env_traj_utils/ provides utilities for converting environmentally-native trajectories ($\mathcal{D}^{\text{env}}$) to LLM-trainable formats:

Script	Description
`convert_trajectories.py`	Convert SWE-agent trajectories to XML function calling format
`tokenize_trajectories.py`	Tokenize trajectories and filter by length

See the env_traj_utils README for quickstart instructions on converting env-native.jsonl to formats compatible with training frameworks like SLIME.

License

This project is a mixed release:

PR-derived subset: only includes PRs from repositories detected as having a permissive license in the open-source release.
Executable rollout subset: derived from SWE-rebench, licensed under CC-BY-4.0.
daVinci-Dev models: released under Qwen license. Users should verify the licensing status of any generated code before using it in production.
daVinci-Dev pipeline: released under the Apache-2.0 license.

Downstream users are responsible for ensuring their usage complies with the licenses of the underlying sources.

Citation

If you use this work, please cite the daVinci-Dev paper.

@misc{zeng2026davincidevagentnativemidtrainingsoftware,
      title={daVinci-Dev: Agent-native Mid-training for Software Engineering},
      author={Ji Zeng and Dayuan Fu and Tiantian Mi and Yumin Zhuang and Yaxing Huang and Xuefeng Li and Lyumanshan Ye and Muhang Xie and Qishuo Hua and Zhen Huang and Mohan Jiang and Hanning Wang and Jifan Lin and Yang Xiao and Jie Sun and Yunze Wu and Pengfei Liu},
      year={2026},
      eprint={2601.18418},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2601.18418},
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
env_traj_utils		env_traj_utils
pipeline		pipeline
LICENSE		LICENSE
README.md		README.md
daVinci-Dev.pdf		daVinci-Dev.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

daVinci-Dev: Agent-native Mid-training for Software Engineering

Table of Contents

Overview

Model Zoo

Datasets

Pipeline

Utilities for Environmentally-native Trajectories

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

daVinci-Dev: Agent-native Mid-training for Software Engineering

Table of Contents

Overview

Model Zoo

Datasets

Pipeline

Utilities for Environmentally-native Trajectories

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages