https://arxiv.org/html/2410.01215v1 by Yuling Shi, Songsong Wang, Chengcheng Wan, Xiaodong Gu
Multi-Granularity Debugger (MGDebugger): Closing the Last Mile of Code Generation with Hierarchical Debugging
Background:
- Large language models have made significant strides in code generation, but pass rate is bottlenecked by subtle errors requiring human intervention
- Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity
Introduction:
- MGDebugger: a hierarchical code debugger that isolates, identifies, and resolves bugs at various levels of granularity
- Decomposes problematic code into a hierarchical tree structure of subfunctions
- Analyzes each subfunction and iteratively resolves bugs in a bottom-up manner
Approach:
- Proposed LLM-simulated Python executor to effectively test each subfunction
- Traces code execution and tracks important variable states to pinpoint errors accurately
Results:
- Achieves an 18.9% improvement in accuracy over seed generations in HumanEval
- Repair success rate of 97.6% in HumanEvalFix
- Effectively fixes bugs across different categories and difficulty levels, demonstrating robustness and effectiveness
Large Language Models (LLMs)
- Significant advances in AI-assisted coding tasks: GPT-4 (OpenAI), LLaMA (Touvron et al.), DeepSeek-Coder (Zhu et al.)
- Trained on vast corpora of text and code, understand and generate code snippets for various programming tasks
- Proficient in tasks like code completion, bug detection, competitive programming challenges
- Generated code may contain critical errors requiring human intervention to pass tests
Code Debugging Paradigm
- New development paradigm: large models generate the code, humans fix it
- Numerous efforts made to debug LLM-generated code
- Most popular way: reuse LLM generator with test case execution feedback
- Increases pass rates but treats erroneous program as a single set of statements
Multi-granularity Debugger (MGDebugger)
- Novel debugging method for LLM-generated code
- Hierarchical, bottom-up strategy to systematically debug code
- Decomposes code into tree structure of sub-functions for semantic unit isolation
- Each sub-function debugged progressively from most granular to higher-level compositions
- Generates test cases derived from public test cases of main function and uses LLM-based execution simulator
- Precise and flexible error identification based on failed test cases
- Uncovers and rectifies bugs traditional methods may overlook
Experimental Results:
- Significantly outperforms existing debugging methods on HumanEval (Chen et al., 2021) and HumanEvalFix (Muennighoff et al., 2023) benchmarks
- Ablation studies confirm hierarchical debugging strategy's vital role
- Effective in handling diverse bug types and varying code lengths, demonstrating robustness and adaptability.
Related Work: Code Generation with LLMs
Advancements in Code Generation:
- GPT4 (OpenAI, 2023), Codestral (Mistral AI team, 2024), DeepSeek-Coder (Zhu et al., 2024) have advanced code generation through:
- Instruction tuning and RLHF with mixed code and natural language data
Improving Code Generation:
- Approaches focus on improving quality of generated code using planning algorithms
- Transition from outlines to detailed implementations (Zhang et al., 2022; Yao et al., 2023; Zelikman et al., 2023; Zhou et al., 2023; Zheng et al., 2023)
- Sampling multiple programs and ranking to identify best one (Chen et al., 2023a; 2022; Ni et al., 2023)
- Leveraging multi-agent collaboration frameworks to enhance code generation quality (Zhang et al., 2024; Huang et al., 2023a; Dong et al., 2024)
MGDebugger Approach:
- Targets the post-generation phase, focusing on debugging and fixing errors that inevitably arise during code generation
- Repairing LLM-Generated Code: a critical aspect of software development (Just et al., 2014; Gupta et al., 2020; Yasunaga & Liang, 2021)
- Two main streams of research in repairing code generated by LLMs:
- Training models to repair code
- (Huang et al., 2023b; Jiang et al., 2024; Ding et al., 2024; Zheng et al., 2024; Moon et al., 2024; Kumar et al., 2024)
- Providing external feedback to raw pretrained models to fix code
- (Jiang et al., 2023; Chen et al., 2023b; Olausson et al., 2023; Zhong et al., 2024; Hu et al., 2024)
- Training models to repair code
- MGDebugger does not require task-specific retraining but takes advantage of inherent capabilities of pretrained LLMs
- Falls under the category of work that leverages pretrained models to fix code by reasoning with external feedback
Recent Debugging Techniques:
- Use execution results from test cases to guide LLMs in code correction (Zhang et al., 2023; Olausson et al., 2023; Bouzenia et al., 2024; Lee et al., 2024; Xia & Zhang, 2023)
- Reflexion (Shinn et al., 2023): LLMs reflect on generated code and use memory buffer for iterative refinement
- Self-Debugging (Chen et al., 2023b): LLMs explain or dry run generated programs, known as rubber duck debugging
- LDB (Zhong et al., 2024): Segments programs into basic blocks and tracks variable values during runtime after each block to verify correctness against task description.
MGDebugger's Hierarchical Approach:
- Debugging from low-level errors to high-level flaws, ensuring a more systematic and accurate debugging process
MGDebugger Methodology Overview:
- MGDebugger: A novel bottom-up hierarchical debugging method for repairing large language model (LLM)-generated code
- Workflow of MGDebugger: Illustrated in Figure 1; includes Hierarchical Code Decomposition, Subfunction Test Case Generation, LLM-Simulated Execution, and Bottom-up Debugging
Hierarchical Code Decomposition:
- Decomposes input buggy code into a hierarchical structure of subfunctions
- Enables systematic identification and resolution of bugs at various levels of granularity
Subfunction Test Case Generation:
- Generates test cases for each subfunction from public test cases of the main function
- Executes these test cases during debugging process
LLM-Simulated Execution:
- Simulates code execution for failed test cases using a large language model
- Monitors critical variables and state changes to pinpoint causes of errors
Bottom-up Debugging:
- Fixes identified bugs in each subfunction and updates it in the hierarchical structure
- Propagates changes to dependent functions through bottom-up debugging approach
Benefits:
- Tackles different types of bugs at various levels of abstraction
- Guarantees a cohesive and systematic debugging process throughout the entire code structure.
Code Decomposition using Hierarchical Structure
Advantages:
- Improves understanding of complex functions (Jain et al., 2023; Zelikman et al., 2023)
- Enables hierarchical debugging
- Facilitates isolated testing and debugging
Process:
- Decompose complex function
f
into smaller subfunctions:(f1,…,fn)
- Organize as a tree with main function
froot
at the top and children representing directly called subfunctions - Adhere to three principles:
- Each subfunction represents minimal reusable unit of code
- Higher-level functions call lower-level functions for complex functionality
- Structure facilitates isolated testing and debugging
Example:
froot = TREE(froot, CHILD(froot))
Principles:
- Minimal reusable unit of code with a specific purpose
- Higher-level functions call lower-level functions for complex functionality
- Structure facilitates isolated testing and debugging
Subfunction Debugging Process
Overview:
- Illustrated in Figure 2
- Initial test case generation for subfunctions
- Subsequent simulation of code execution
- Pinpointing errors accurately
Subfunction Verification:
- Generate test cases for each subfunction fi∈froot using automatic techniques (Wang et al., 2021; Schäfer et al., 2024; Liu et al., 2023a)
- Leverage provided public test cases 𝒯pub for main function to derive corresponding test cases for each subfunction
- LLM performs steps: analyze usage in main function and contribute to expected outputs, figure out input and output for each public test case
- Ensure generated test cases are reflective of intended functionality and contextually relevant within constraints provided by public test cases.
Debugging Subfunctions with LLM-simulated Execution
Process Overview:
- Use generated test cases to debug each subfunction
- Run on test case inputs, obtain results, compare against expected outcomes
- Fix failed test cases by correcting corresponding subfunction
Debugging High-Level Functions:
- Often unnecessary to track variable values in lower-level subfunctions
- External debugger can add overhead and complexity
LLM-Simulated Code Executor:
- Prompts LLM to act as a Python interpreter and track code execution
- Eliminates need for external debugger, offers more flexible and efficient solution
- Accurately identifies errors and their context
Bottom-up Recursive Debugging Algorithm (MGDebugger):
- Input: LLM-generated function
f
, public test cases𝒯pub
- Output: Debugged function
f′
- Function:
- If
f
has subfunctions, traverse hierarchy depth-first and recursively debug each one - Generate test cases for
f
, execute, and check results against expected outcomes - If correct: return
f
; if not, debugf
based on test results
- If
- Function Debugging: Identify and fix bugs in
f
using information from the failed test case execution (ℛf
) - Return: Corrected function
f′
- Debugging Overall Workflow: Call
MGDebugger
on main functionfroot
with public test cases𝒯pub
, recursively debug subfunctions, propagate changes
Code Generation and Debugging Experiments
- Three state-of-the-art Language Models (LLMs) used for code generation: CodeQwen1.5 (7B), DeepSeek-Coder-V2-Lite (16B), and Codestral (22B)
- Three datasets used: HumanEval, MBPP, HumanEvalFix
- Metrics: Accuracy, Repair Success Rate (RSR)
- Baselines: Simple Feedback, Self-Edit, Self-Debugging (Expl.), Self-Debugging (Trace), LDB (Block/Line/Function), Reflexion, MGDebugger
Results on HumanEval and MBPP Datasets:
Model | Method | Dataset | Accuracy | Improvement | Repair Success Rate |
---|---|---|---|---|---|
DeepSeek-Coder-V2-Lite | No-Debugging | HumanEval, MBPP | 76.8%, 84.1% | N/A, N/A | 67.2%, 71.2% |
... | Simple Feedback | Both | 82.3%, 85.4% | +5.5%, +9.2% | 69.4%, 74.0% |
... | Self-Edit | Both | 82.9%, 84.1% | +6.1%, +7.9% | 26.3%, 33.3% |
... | LDB (Block) | Both | 84.1%, 79.3% | +7.3%, +3.1% | 31.6%, 12.8% |
... | Self-Debugging (Expl.) | Both | 87.2%, 87.8% | +10.4%, +11.6% | 44.7%, 48.7% |
... | Self-Debugging (Trace) | Both | 86.0%, 84.8% | +9.2%, +8.6% | 39.5%, 35.9% |
... | Reflexion | Both | 90.9%, 87.8% | +14.1%, +11.6% | 60.5%, 60.5% |
MGDebugger | No-Debugging | HumanEval, MBPP | N/A, N/A | N/A, N/A | N/A, N/A |
MGDebugger | Simple Feedback | Both | 87.2%, 91.5% | N/A, +15.3% | N/A, +13.4% |
MGDebugger | Self-Edit | Both | 85.4%, 86.0% | -1.0%, -1.9% | 38.5%, 42.5% |
MGDebugger | LDB (Block) | Both | 79.3%, 83.5% | +3.1%, +7.9% | 12.8%, 32.5% |
MGDebugger | Self-Debugging (Expl.) | Both | 87.8%, 89.6% | N/A, +14.0% | N/A, +57.5% |
MGDebugger | Self-Debugging (Trace) | Both | 84.1%, 82.3% | +8.5%, -1.7% | 35.0%, 27.5% |
MGDebugger | Reflexion | Both | 86.6%, 86.6% | N/A, N/A | 45.0%, 45.0% |
CodeQwen1.5 | No-Debugging | HumanEval, MBPP | 76.2%, 75.6% | N/A, N/A | 67.4%, 65.4% |
... | Simple Feedback | Both | 85.4%, 88.4% | +9.2%, +12.8% | 38.5%, 52.5% |
... | Self-Edit | Both | 84.1%, 86.0% | +7.9%, +10.4% | 33.3%, 42.5% |
... | LDB (Block) | Both | 79.3%, 83.5% | +3.1%, +7.9% | 12.8%, 32.5% |
... | Self-Debugging (Expl.) | Both | 87.8%, 89.6% | N/A, +14.0% | N/A, +57.5% |
... | Self-Debugging (Trace) | Both | 84.8%, 82.3% | +8.6%, -2.1% | 35.9%, 27.5% |
... | Reflexion | Both | 87.8%, 86.6% | N/A, N/A | 48.7%, 45.0% |
Codestral | No-Debugging | HumanEval, MBPP | 75.6%, 75.6% | N/A, N/A | 65.4%, 65.4% |
... | Simple Feedback | Both | 88.4%, 94.5% | +12.8%, +18.9% | 71.6%, 76.8% |
... | Self-Edit | Both | 86.0%, 86.0% | N/A, N/A | 42.5%, 30.0% |
... | LDB (Block) | Both | 83.5%, 79.5% | +7.9%, -3.1% | 32.5%, 30.0% |
... | Self-Debugging (Expl.) | Both | 89.6%, 94.5% | N/A, +14.0% | N/A, +39.0% |
... | Self-Debugging (Trace) | Both | 82.3%, 78.8% | -3.7%, -3.6% | 27.5%, 22.3% |
... | Reflexion | Both | 86.6%, 94.5% | N/A, +11.0% | N/A, +34.4% |
Key Findings:
- MGDebugger outperforms existing approaches: consistently achieves higher accuracy improvements on HumanEval (+15.3% to +18.9%) and MBPP (+11.4% to +13.4%) compared to Self-Debugging (Expl.) and Reflexion.
- Strong results across various models: demonstrates adaptability to different LLM architectures, achieving high accuracy with DeepSeek-Coder-V2-Lite (16B) and Codestral (22B).
- Impressive zero-shot performance: operates in a zero-shot setting without task-specific retraining, demonstrating inherent debugging ability of larger LLMs.
- Robustness across RSR: achieves up to 41.1% RSR on MBPP with smaller models like CodeQwen1.5 (7B).
Method Description:
- MGDebugger is a highly effective and scalable debugging method for large language models (LLMs).
- It consistently outperforms existing debugging approaches across various models and datasets, including HumanEval and MBPP.
- The method demonstrates strong results with DeepSeek-Coder-V2-Lite (16B) and Codestral (22B), achieving an accuracy of 94.5% on the HumanEval dataset and remarkable debugging capabilities.
- MGDebugger operates in a zero-shot setting without task-specific retraining, illustrating its inherent debugging ability in larger LLMs.
Ablation Study Results for DeepSeek-Coder-V2-Lite Model
Table 2: Ablation study results for DeepSeek-Coder-V2-Lite on HumanEval and MBPP datasets.
Method | HumanEval Accuracy | Δ Accuracy | Repair Success Rate (RSR) | MBPP Accuracy | Δ Accuracy | RSR |
---|---|---|---|---|---|---|
MGDebugger | 94.5% | +17.7% | 76.3% | 80.0% | +12.8% | - |
w/o Hierarchical Debugging | 89.0% | +12.2% | 52.6% | 78.2% | +11.0% | 33.5% |
w/o Simulated Execution | 90.2% | +13.4% | 61.3% | 79.2% | +12.0% | 36.6% |
w/o Test Case Generation | 90.9% | +14.1% | 60.5% | 79.2% | +12.0% | 36.6% |
No-Debugging | 76.8% | N/A | 67.2% | - | - | - |
Components and Their Impact on MGDebugger:
- Hierarchical Debugging Strategy: The most impactful component, removing it drops the repair success rate significantly from 76.3% to 52.6% for HumanEval and from 39.0% to 33.5% for MBPP.
- LLM-Simulated Execution and Test Case Generation for Subfunctions: Facilitate debugging the decomposed code, resulting in substantial improvements in accuracy and repair success rates.
Experiments on HumanEvalFix Dataset
- Assessing versatility and effectiveness of MGDebugger using HumanEvalFix dataset
- Six distinct bug categories: value misuse, missing logic, excess logic, operator misuse, variable misuse, function misuse
Results for Different Methods (RSRs)
Method | Missing Logic | Excess Logic | Operator | Variable | Function | Overall |
---|---|---|---|---|---|---|
DeepSeek-Coder-V2-Lite | 85.4 | 80.7 | 78.3 | 86.4 | 87.5 | 85.4 |
Self-Edit | 82.3 | 62.5 | 80.7 | 84.1 | 62.5 | 82.3 |
LDB (Block) | 81.1 | 96.0 | 74.2 | 86.4 | 62.5 | 81.1 |
LDB (Line) | 74.4 | 52.4 | 56.5 | 54.6 | 62.5 | 74.4 |
LDB (Function) | 76.8 | 51.2 | 87.0 | 73.9 | 59.1 | 76.8 |
Self-Debugging (Expl.) | 78.7 | 62.5 | 69.6 | 77.3 | 78.7 | 74.4 |
Self-Debugging (Trace) | 79.3 | 75.0 | 80.6 | 69.6 | 70.5 | 73.2 |
Reflexion | 81.5 | 100.0 | 80.6 | 91.3 | 75.0 | 81.7 |
MGDebugger | 97.6 (using DeepSeek-Coder) | N/A | N/A | N/A | N/A | N/A |
Conclusion:
- MGDebugger outperforms other methods with significantly higher overall accuracies in HumanEvalFix dataset experiments.
- Achieves remarkable repair success rate of 97.6% using DeepSeek-Coder, with 100% success rates in all bug categories except for value misuse.
- Strong advantage in debugging bottom-level bugs like missing logic and excess logic compared to other methods due to hierarchical decomposition.
MGDebugger's Performance Compared to Other Methods
- Refer to Figure 3: Repair success rate of different methods on code snippets of varying lengths using DeepSeek-Coder.
- MGDebugger outperforms other methods in long codes.
- Figure 4: Impact of debugging attempts on cumulative repair success rates for MGDebugger and other methods on HumanEvalFix with DeepSeek-Coder.
- MGDebugger improves with more debug attempts and achieves highest success rate.
Assessing Versatility of MGDebugger in Debugging Code of Different Lengths
- Code length: Correlates with complexity and debugging challenges.
- Three groups: short, medium, and long based on equal sample sizes from HumanEvalFix dataset.
- Results presented in Figure 4.
- As code length increases, most methods perform poorly due to increased complexity.
- MGDebugger consistently outperforms other methods across different code lengths, especially in longer and more complex snippets.
- Demonstrates scalability and robustness of MGDebugger for handling varying lengths and complexities.
Additional Results on Other Datasets:
- Available in Appendix D (not provided).
- Consistently outperforms other methods across different code lengths on those datasets as well.
Impact of Debug Attempts
MGDebugger's Ability:
- Improves over successive iterations during debugging attempts using LLMs
- Achieves highest cumulative RSR score among all methods on HumanEvalFix dataset and DeepSeat-Coder
- Outperforms other methods from the first attempt and continues to improve with great potential
Comparison to Other Methods:
- Most methods plateau after a few debug attempts
- MGDebugger and Reflexion continue to improve with more iterations
- Highlights MGDebugger's potential for iterative and comprehensive debugging
- Promising solution for complex and challenging code repair tasks.
MGDebugger vs Baseline Methods in Code Debugging
Key Findings:
- MGDebugger successfully identifies and corrects bugs, unlike baseline methods that introduce new bugs
- This is demonstrated through debugging code snippets from HumanEvalFix dataset using DeepSeek-Coder as the backbone LLM
- Original buggy solution had incorrect computation logic for Collatz sequence (n = n × 2 + 1 instead of n = n × 3 + 1)
- Baseline methods corrected this, but missed last "1" in the Collatz sequence by filtering odd numbers first and appending number to results after updating n
- MGDebugger decomposed problem into distinct subfunctions: sequence generation and odd number filtering, ensuring comprehensive error correction
- MGDebugger's ability to handle complex problems more effectively and restructure code for clarity and correctness demonstrates its potential in improving LLM-generated code quality.
Conclusion
MGDebugger:
- Novel hierarchical code debugging framework
- Fixes bugs at multiple levels of granularity
- Decomposes complex code into a hierarchical structure
- Generates targeted test cases and employs LLM-simulated execution
- Identifies and fixes syntax errors to logical flaws in a bottom-up manner
Performance:
- Superior performance over existing methods, particularly in handling complex logical errors and longer code snipplets
Future Work:
- Handling more complex bugs and code structures: Extending MGDebugger to handle multi-file projects and codebase with multiple dependencies.
- Exploring collaboration of hierarchical approaches: Building on the foundation of Parsel (Zelikman et al., 2023) for end-to-end code generation and debugging systems.
- Integrating MGDebugger into self-training systems: Correcting outputs from base models, then retraining the base models with the corrected data to potentially improve performance iteratively (Gulcehre et al., 2023).