Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

Latest commit

 

History

History
63 lines (48 loc) · 5.38 KB

README.md

File metadata and controls

63 lines (48 loc) · 5.38 KB

[ICLR 2024]: Is Self-Repair a Silver Bullet for Code Generation?

This is is the accompanying repository for the paper Is Self-Repair a Silver Bullet for Code Generation?, presented at the Twelfth International Conference on Learning Representations (Vienna, May 2024). It contains source code used to run the experiments; the resulting data; as well as scripts to replicate the data analysis and figures from the paper.

To install the libraries needed to run the code and analysis scripts, you can use pip install -r requirements.txt.

TL;DR: Replicating the Figures

All figures in the paper can be replicated by running cd paper && make figures. This will use pre-computed results of the data analysis, and will place the figures in paper/figures/. If you instead want to do all of the data analysis from scratch, run APPS_DIR=<path to my APPS directory> cd paper && make all; note that this requires having APPS installed locally.

N.B.: This repository does not contain the data collected during the human study, due to IRB policies.

Bibtex Citation

@inproceedings{olausson2024repair,
	title        = {Is Self-Repair a Silver Bullet for Code Generation?},
	author       = {Theo X. Olausson and Jeevana Priya Inala and Chenglong Wang and Jianfeng Gao and Armando Solar-Lezama},
	year         = 2024,
	booktitle    = {International Conference on Learning Representations (ICLR)}
}

A Note on HumanEval

Note: the below only applies if you want to use this code base to run new self-repair experiments on HumanEval yourself. You do not need to worry about this if you are merely interested in replicating the figures and results from this paper.

This code base uses a modified version of HumanEval, in which it is easier to extract error messages from failed assertions. This can be downloaded from people.csail.mit.edu/theoxo/data/HumanEval_with_assertion_messages.jsonl.gz.gpg; you can then decrypt it with gpg -d using the password theoxoiclr2024 and unpack it with gunzip, after which it can be used as a drop-in replacement for HumanEval.jsonl in your local installation of HumanEval.

A Note on APPS

Note: the below only applies if you want to use this code base to run new self-repair experiments on APPS yourself. You do not need to worry about this if you are merely interested in replicating the figures and results from this paper.

Due to dependencies on an internal project, one function (exec_sample) has been left unimplemented in src/apps/apps.py. If you want to make use of the APPS part of the source code, you must implement this function; see the doc-string for pointers.

Repository Structure

  • src/: source code used to run the experiments.
    • apps/: source code for experiments on APPS.
    • humaneval/: source code for experiments on humaneval.
  • paper/: data and scripts used to analyze and plot the results of the experiments.
    • Makefile: makefile to reproduce figures (make figures), run the analysis scripts (make analysis) or both (make all)
    • analysis/sample-and-estimate.py: Python script to generate bootstrapped estimates of pass rates at various budgets.
    • data/:
      • calculate-token-counts.py: Python script to add counts for how many tokens were used to generate the programs/feedback/repairs. Used for pass@t metrics in Appendix A.
      • apps/: data from APPS experiments, with bash scripts to analyze the data and plot the results.
      • humaneval/: data from humaneval experiments, with bash scripts to analyze the data and plot the results.
    • plotting/: Python scripts to generate the types of figures used in the paper.

Data Format

The data generated by the models can be found by de-compressing the tarballs paper/data/apps/apps-data.tar.bz2 and paper/data/humaneval/humaneval-data.tar.bz2. Data files are in .jsonl format: each line is a valid json serialization. The data contains the following fields:

  • prob_path/task_id (for APPS and HumanEval, respectively): the identifier for the particular problem/task.
  • completions: a list of
    • original_completion: the completion before any processing
    • executed_completion: the completion after processing/execution
    • tokens_generated: the total number of tokens for the (executed) completion
    • binary: (poorly named) boolean flag for whether this completion passed the tests or not
    • fault: passed if the completion passed, otherwise the error message received
    • errors: a list of execution results for each unit test, in order (APPS only)
    • repairs: if the completion passed, null. Otherwise, a list of items much like the completions, except also equipped with an explanation field (which, in the case of modelX+modelY results, is generated separately by modelX). Note that for repairs, tokens_generated counts both the program and the explanation (all text preceding it).

In addition to these tarballs, there are also additional tarballs with the -raw postfix. These are identical to the above, but also contain some auxiliary fields which are irrelevant for the final analysis but where used during debugging and running these large experiments. Any auxiliary fields present in these raw tarballs should be considered legacy and possibly inaccurate.