Name	Name	Last commit message	Last commit date
Latest commit History 230 Commits
.github/workflows	.github/workflows
commit0	commit0
docs	docs
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
README.md	README.md
mkdocs.yml	mkdocs.yml
pyproject.toml	pyproject.toml
uv.lock	uv.lock

spec2repo

We set up a task where, given a specification, the goal is to produce an implementation of the specification. Specifically, we are interested in converting library specifications to implementations (i.e., repositories). We lay out the steps to create a spec2repo example and perform an evaluation on the example using the SWE-bench framework.

First, to install required packages,

pip install -r requirements.txt

Please provide the following information for the list of repositories in a YAML file,

repos.yml
0: # just an index
  name: [repository_name] # in the form of {organization_name}/{library_name}
  commit: [commit_sha]
  tag: [version_tag]
  setup:
    - [command_1]
    - [command_2]
    - ...

There are two options to specify the version of the library: you can either provide a specific commit or a specific tag. You cannot specify both at the same time. Finally, include the commands that sets up the library from a local repository. For example, to create an example for the msiemens/tinydb with version 4.8,

repos.yml
0:
  name: "msiemens/tinydb"
  commit: null
  tag: "v4.8.0"
  setup:
    - "python -m pip install --upgrade pip twine"
    - "pip install poetry"
    - "poetry install"

We are now ready to generate the dataset. Before that, add your GitHub token in the environment.

export GITHUB_TOKEN=[github_token]

Now run,

python create-data/build_dataset.py repos.json --hf_name wentingzhao/spec2repo

where repos.json is the file we specified above, and wentingzhao/spec2repo is where you want to upload the dataset on HF. This command produces the base commit (with function body removed), gold patch that passes all unit tests, and all test function names. Note that this script will create a fork for the libaray. The fork will be created under organization spec2repo. You can change the organization to somewhere else. But if you want to create a fork under spec2repo, please contact Wenting Zhao to be added to the organization.

Now that dataset has been generated, we move on to using SWE-bench to perform an evaluation. First, follow the instructions in the Docker setup guide to install Docker on your machine. If you're setting up on Linux, we recommend seeing the post-installation steps as well.

To install SWE-bench:

git clone https://github.com/princeton-nlp/SWE-bench.git
cd SWE-bench
pip install -e .

Now, let's add a configuration file to build a DOCKER environment for the library in a YAML file:

configs/specs.yml
spec2repo/tinydb:
  "1.0":
    python: 3.11
    install: "python -m pip install --upgrade pip twine; pip install poetry; poetry install"
    test_cmd: "pytest"

To make this for your own library, leave the 1.0 unchanged, specify the Python version with python, and how to locally build the library with install, and how to run tests with test_cmd.

You also need to write your own function to process the test logs. Please add your function in configs/log_parsers.py. The function should take in a log text file and return a dictionary that maps from a test function to its test stutas such as passed or failed. After that, update the global variable ADD_MAP_REPO_TO_PARSER.

configs/log_parsers.py
def parse_log_tinydb(log: str) -> dict[str, str]:
    """
    Parser for test logs generated with TinyDB framework

    Args:
        log (str): log content
    Returns:
        dict: test case to test status mapping
    """
    test_status_map = {}
    pattern = r"^(.*\/.*)::(.*)\s+\w+\s+\[\s*(\d+%)\]$"
    for line in log.split("\n"):
        line = line.strip()
        m = re.match(pattern, line)
        if m:
            line = line.split()
            test, value = line[:2]
            if value == "PASSED":
                test_status_map[test] = TestStatus.PASSED.value
            else:
                test_status_map[test] = TestStatus.FAILED.value
    return test_status_map

ADD_MAP_REPO_TO_PARSER = {
    "spec2repo/tinydb": parse_log_tinydb,
}

Finally, to run evaluation for the created example using the gold patch with the following script:

python run.py \
    --dataset_name wentingzhao/spec2repo \
    --split train \
    --max_workers 2 \
    --predictions_path 'gold' \
    --instance_ids spec2repo__tinydb-01 \
    --run_id validate-gold \
    --spec_config configs/specs.yml

Baseline

Baseline Input & Output

A simple baseline evaluation can be described like this

def run_baseline(base_model, agent, prompt, context, target, error_history) -> test_results, error_message:
    pass

Input

base_model: base LLM, e.g. gpt-4o, claude-3-5-sonnet-20240620

agent: agent, e.g. aider, opendevin, None

prompt: the prompt/instruction given to agent/base_model

context: there are 3 types of context

context-type-1: reference doc/pdf/website
context-type-2: unit_tests that target will be tested with
context-type-3: Repo info
- skeleton of the repo(filenames under each dir)
- function stubs
- function name in each file(granularity need to be specified)

target: target function or file for agent or base_model to complete edit_history: entire edit histories, each contains previous implementation, updated implementation and corresponding error message

Output

test_results: WIP error_message: WIP

Baseline Evaluation & Ablation

There are mainly 3 axes.

different base_model
different agent
different context

Current priority is to run gpt-4o + aider with certain context to get first baseline result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spec2repo

Baseline

Baseline Input & Output

Baseline Evaluation & Ablation

About

Uh oh!

Releases

Packages

Used by 110

Contributors 5

Uh oh!

Languages

License

commit-0/commit0

Folders and files

Latest commit

History

Repository files navigation

spec2repo

Baseline

Baseline Input & Output

Baseline Evaluation & Ablation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Used by 110

Contributors 5

Uh oh!

Languages

Packages