Skip to content

GAIR-NLP/daVinci-Dev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SII GAIR

Paper arXiv GitHub Hugging Face Hugging Face

daVinci-Dev: Agent-native Mid-training for Software Engineering

Table of Contents

Overview

daVinci-Dev is a family of large language models trained for agentic software engineering.

This repo provides:

  • The paper PDF: daVinci-Dev.pdf
  • A high-performance data processing pipeline under pipeline/ that calls the GitHub API to construct contextually-native PR trajectories $\mathcal{D}^{\text{ctx}}_{\text{py}}$

Model Zoo

We will open-source model checkpoints on Hugging Face:

Model Description Link
daVinci-Dev-72B Final model (agent-native mid-training + env native SFT) https://huggingface.co/GAIR/daVinci-Dev-72B
daVinci-Dev-32B Final model (agent-native mid-training + env native SFT) https://huggingface.co/GAIR/daVinci-Dev-32B
daVinci-Dev-72B-MT MT checkpoint (after agent-native mid-training, before SFT) https://huggingface.co/GAIR/daVinci-Dev-72B-MT
daVinci-Dev-32B-MT MT checkpoint (after agent-native mid-training, before SFT) https://huggingface.co/GAIR/daVinci-Dev-32B-MT

Datasets

Datasets are released through Hugging Face:

Dataset Description Link
daVinci-Dev Agent-native trajectories used in our training recipe (as permitted) https://huggingface.co/datasets/GAIR/daVinci-Dev

High-level composition (see the paper for details):

  • Contextually-native trajectories $\mathcal{D}^{\text{ctx}}_{\text{py}}$ (PR-derived, Python variant)
  • Environmentally-native trajectories $\mathcal{D}^{\text{env}}_{\text{pass}}$ (executable rollouts, test-passing subset)

Pipeline

The directory pipeline/ contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$.

Pipeline Description Link
daVinci-Dev Pipeline a high-performance pipeline used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$ pipeline/

Utilities for Environmentally-native Trajectories

The directory env_traj_utils/ provides utilities for converting environmentally-native trajectories ($\mathcal{D}^{\text{env}}$) to LLM-trainable formats:

Script Description
convert_trajectories.py Convert SWE-agent trajectories to XML function calling format
tokenize_trajectories.py Tokenize trajectories and filter by length

See the env_traj_utils README for quickstart instructions on converting env-native.jsonl to formats compatible with training frameworks like SLIME.

License

This project is a mixed release:

  • PR-derived subset: only includes PRs from repositories detected as having a permissive license in the open-source release.
  • Executable rollout subset: derived from SWE-rebench, licensed under CC-BY-4.0.
  • daVinci-Dev models: released under Qwen license. Users should verify the licensing status of any generated code before using it in production.
  • daVinci-Dev pipeline: released under the Apache-2.0 license.

Downstream users are responsible for ensuring their usage complies with the licenses of the underlying sources.

Citation

If you use this work, please cite the daVinci-Dev paper.

@misc{zeng2026davincidevagentnativemidtrainingsoftware,
      title={daVinci-Dev: Agent-native Mid-training for Software Engineering},
      author={Ji Zeng and Dayuan Fu and Tiantian Mi and Yumin Zhuang and Yaxing Huang and Xuefeng Li and Lyumanshan Ye and Muhang Xie and Qishuo Hua and Zhen Huang and Mohan Jiang and Hanning Wang and Jifan Lin and Yang Xiao and Jie Sun and Yunze Wu and Pengfei Liu},
      year={2026},
      eprint={2601.18418},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2601.18418},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages