Skip to content

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

Notifications You must be signed in to change notification settings

Elfsong/Afterburner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

151 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

arXiv Monolith HuggingFace HuggingFace

Abstract: Large Language Models (LLMs) generate functionally correct solutions but often fall short in code efficiency, a critical bottleneck for real-world deployment. In this paper, we introduce a novel test-time iterative optimization framework to address this, employing a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from an execution sandbox. We explore three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). Experiments on our Venus dataset and the APPS benchmark show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO, using reinforcement learning (RL) with execution feedback, continuously optimizes code performance, significantly boosting both pass@1 (from 47% to 62%) and the likelihood of outperforming human submissions in efficiency (from 31% to 45%). Our work demonstrates effective test-time code efficiency improvement and critically reveals the power of RL in teaching LLMs to truly self-improve code efficiency.

While current LLMs excel at generating code that works, can we trust that generated code in real-world applications? Often, the answer is NO. Since current research has largely focused on functional correctness, leaving code efficiency as a significant bottleneck.

To tackle this, we introduce Afterburner, an iterative framework that leverages reinforcement learning (RL) to instruct LLMs how to generate code that is not only correct but also efficient. Afterburner creates a self-improving loop, continually refining code for better performance:

  • 🔮 While SFT & DPO methods plateau, our RL approach shows continuous improvement.
  • 📈 Pass@1 boosted from 47% to 62%.
  • 🏆 Outperforms human code efficiency likelihood jumps from 31% to 45%.

image

Overview

image

In RL, there are three key components: algorithm, environment, and priors. For a long time, RL researchers focused mostly on the algorithm (e.g. REINFORCE, DQN, TD-learning, actor-critic, PPO, TRPO…) – the intellectual core of how an agent learns – while treating the environment and priors as fixed or minimal. For example, Sutton and Barto’s classical textbook is all about algorithms and almost nothing about environments or priors. However, in the era of deep RL, it became clear that environments matter a lot empirically: an algorithm’s performance is often highly specific to the environment it was developed and tested in. -- The Second Half

In this work, we introduce a novel iterative optimization framework (IOF) designed to enhance LLM-generated code efficiency through a closed-loop system of generation and evaluation, driven by Monolith and Afterburner trained on Venus.

Step 1. Dataset (Venus)

Venus is the dataset used to train Afterburner. It is an extension of the original Mercury dataset and currently includes 6 languages: Python3, C++, Javascript, Go, Rust, and Java.

image

Step 2. Environment (Monolith)

Monolith is the code execution environment for Afterburner. It support parallel code execution for RL rollout (Isolated container with 100% CPU affinity) and high resolution performance measurement (10 kHz). It measures three key metrics for each task from Venus: 1) Running Time, 2) Memory Usage, and 3) Integral Score (The integral area of ​​running time versus memory usage).

image

Step 3. Algorithm (Afterburner)

We explore three optimization strategies within IOF. Namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO).

image

SFT tends to capture superficial patterns from mimicking examples. DPO internalizes static preferences based on pairwise comparisons from offline data. In contrast, through online interaction with execution feedback, GRPO cultivates an adaptive proficiency in code efficiency optimization, which enables it to explore and exploit the solution space effectively within an iterative, test-time optimization process.

image

Recipe for SFT & DPO

# Add this json to 'LLaMA-Factory/data/dataset_info.json'
"venus_python_integral_dpo": {
  "hf_hub_url": "Elfsong/Venus_DPO_Data",
  "ranking": true,
  "columns": {
    "prompt": "prompt",
    "chosen": "chosen",
    "rejected": "rejected"
  },
  "split": "integral"
},
"venus_python_integral_sft": {
  "hf_hub_url": "Elfsong/Venus_SFT_Data",
  "columns": {
    "prompt": "prompt",
    "response": "response"
  },
  "split": "integral"
} 

Recipe for GRPO

# Step 1. Data Preperation
See https://github.com/Elfsong/Afterburner/blob/main/grpo/afterburner_dataset.py

# Step 2. Reward Function
See https://github.com/Elfsong/Afterburner/blob/main/grpo/afterburner_reward_function.py

# Step 3. Training
See https://github.com/Elfsong/Afterburner/blob/main/grpo/afterburner_train.sh

Step 4. Evaluation (Litmus)

Despite achieving high functional correctness (PASS@1), vanilla models generate code with strikingly inferior computational efficiency compared to human solutions. While stronger (bigger) models exhibit marginally better code efficiency, this is insufficient to overcome the fundamental gap. This pervasive efficiency deficit in LLM-generated code clearly motivates the development of dedicated optimization frameworks, such as Afterburner, to enhance code generation in real-world applications.

image

Citation

@article{du2025afterburner,
  title={Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization},
  author={Du, Mingzhe and Luu, Anh Tuan and Liu, Yue and Qing, Yuhao and Huang, Dong and He, Xinyi and Liu, Qian and Ma, Zejun and Ng, See-kiong},
  booktitle={https://arxiv.org/abs/2505.23387},
  year={2025}
}

About

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published