Two contracts: what you feed Repo2RLEnv (input) and what comes out (output). Same input shape across every pipeline; pipeline-specific knobs go under pipeline.options (the "kwargs"). Output is pure Harbor with a namespaced [metadata.repo2env] extension.
The single root model is src/repo2rlenv/spec/input.py:GenerationInput. The CLI is a thin shim that builds this object from flags and/or a YAML/TOML config file.
class GenerationInput(BaseModel):
spec_version: Literal["0.1.0"] = "0.1.0"
repo: RepoSpec # source repository
pipeline: PipelineSpec # synthesis method + kwargs
llm: LLMSpec # model driving synthesis
output: OutputSpec # where the dataset lands
qa: QASpec = QASpec() # quality gate (default: diff_parse only for lite)
sandbox: SandboxSpec = SandboxSpec() # execution backend (default: none for lite)
auth: AuthSpec = AuthSpec() # secret references, resolved from env| Model | Required fields | Notes |
|---|---|---|
RepoSpec |
url |
access ∈ {public, private, auto}, optional auth_token_env, ref defaults to HEAD |
PipelineSpec |
name, options |
name is an enum (see pipelines/); options is validated against the named pipeline's Options model with extra="forbid" |
LLMSpec |
provider, model |
provider/model resolves to a LiteLLM identifier; supports endpoint for self-hosted vLLM/Ollama |
OutputSpec |
destination, org, dataset_name |
destination is a local path; publish separately via repo2rlenv push |
QASpec |
(none) | Defaults to [diff_parse] for the lite path; full pipelines opt into [determinism, oracle_consistency, llm_judge, false_negative] |
SandboxSpec |
(none) | See "Sandbox model" below — none for lite, harbor for full pipelines (delegates), local/e2b for lite consumer-side runners |
AuthSpec |
(none) | Names of env vars only — values never stored |
Flag form:
# Generate locally, then push as two explicit steps
repo2rlenv generate \
--repo huggingface/trl \
--pipeline pr_diff \
--pipeline-opt limit=5 \
--llm anthropic/claude-sonnet-4-6 \
--out ./datasets/trl-r2e-v0-1
repo2rlenv push ./datasets/trl-r2e-v0-1 <your-org>/trl-r2e-v0-1Config-file form — --config <path> accepts YAML or TOML, format auto-detected by extension. CLI flags override file fields:
spec_version: "0.1.0"
repo:
url: "huggingface/trl"
access: "auto"
pipeline:
name: "pr_diff"
options:
limit: 5
skip_drafts: true
llm:
provider: "anthropic"
model: "claude-sonnet-4-6"
output:
destination: "./datasets/trl-r2e-v0-1"
org: "<your-org>"
dataset_name: "trl-r2e-v0-1"
visibility: "public"Drop this into a file (e.g. repo2rlenv.config.yaml) and run with --config <path>. Publishing is a separate step via repo2rlenv push.
Every pipeline emits standard Harbor task directories. Repo2RLEnv-specific provenance goes into a namespaced subtable inside Harbor's existing [metadata].
<dataset>/<task_id>/
├── task.toml # Harbor-native + [metadata.repo2env]
├── instruction.md # natural-language prompt
├── solution/patch.diff # oracle (lite) — for diff_similarity scoring
├── environment/Dockerfile # OPTIONAL — only emitted by sandbox-required pipelines
└── tests/test.sh # OPTIONAL — only emitted by sandbox-required pipelines
For the lite pipeline (pr_diff), only the first three exist. No Docker, no test script — verification is purely diff-similarity against the oracle.
version = "1.0"
[task]
name = "huggingface__trl-5705"
org = "<your-org>"
description = "..."
[metadata]
difficulty = "medium"
category = "bugfix"
[metadata.repo2env]
spec_version = "0.2.0"
pipeline = "pr_diff"
pipeline_version = "0.1.0"
repo = "huggingface/trl"
ref = "f39373edcd7a..." # base commit SHA
reference = "https://github.com/huggingface/trl/pull/5705"
source_access = "public"
built_at = "2026-05-06T..."
synthesis_llm = "anthropic/claude-sonnet-4-6"
content_hash = "sha256:..."
reward_kinds = ["diff_similarity"]
[metadata.repo2env.pr_diff]
pr_merged_at = "2026-05-05T13:46:07Z"
diff_format = "unified"
context_files = ["trl/trainer/dpo_trainer.py", ...]
# v0.2.0+ only — sandbox-required tasks (pr_runtime / mutation_bugs / ...)
# carry this subtable so consumers know exactly what they're getting.
[metadata.repo2env.reproducibility]
mode = "registry" # registry | inline_dockerfile | local_only
image_ref = "ghcr.io/huggingface/r2e-bootstrap-pallets-click@sha256:..."
image_tag = "ghcr.io/huggingface/r2e-bootstrap-pallets-click:a1b2c3d4e5f6-7d8e9f01"
image_visibility = "public" # public | private | unknown
pushed_at = "2026-05-19T11:30:00Z"
pushed_by = "huggingface"
# Inline-mode-only fields (omitted in registry mode):
# inline_recipe_sha256 = "sha256:..."
# inline_recipe_lines = 47
# inline_recipe_source = "agent_replay" # or "user_dockerfile"
# fallback_reason = "no working registry credentials (ghcr.io: L2 auth failed)"
[agent]
timeout_sec = 1800.0
[verifier]
timeout_sec = 300.0Each pipeline writes its own subtable under [metadata.repo2env.<name>] carrying provenance specific to how the task was made — see the per-pipeline docs for the schema.
When pushed via repo2rlenv push, the dataset on the Hub looks like:
huggingface.co/datasets/<owner>/<name>/
├── README.md # auto-generated dataset card
├── registry.json # Harbor's legacy registry format, pinned to a commit SHA
└── tasks/
└── <task_id>/...
registry.json lets any Harbor consumer pull tasks directly:
harbor download <dataset-name> \
--registry-url https://huggingface.co/datasets/<owner>/<name>/resolve/main/registry.jsonImplementation: src/repo2rlenv/hub.py:push_to_hub.
Repo2RLEnv has no sandbox abstraction of its own. Generation-time execution and consumption-time execution both go through external tools:
| Phase | Pipeline class | What runs the code |
|---|---|---|
| Generation | Lite (text-only, e.g. pr_diff) |
Nothing — pure text manipulation |
| Generation | Full (pr_runtime, mutation_bugs, etc.) |
Harbor's sandbox layer (harbor invoked under the hood) |
| Consumption | Lite | from repo2rlenv.reward import calculate_diff_similarity_reward — pure Python, no sandbox |
| Consumption | Full | harbor run -d <dataset> -e <modal|daytona|e2b|local|runloop> ... |
SandboxSpec exists to describe what the pipeline needs (provider, GPU, network), and at gen-time we lower it onto Harbor's flags. We don't ship a parallel runner. This keeps the surface area small — Harbor already handles GPU, multi-container, parallelism, and provider auth.
class GPUSpec(BaseModel):
count: int = 1
kind: Literal["any", "a10g", "a100", "h100", "l4", "t4"] = "any"GPU is only meaningful for sandbox-required pipelines on ML repos — e.g., mining huggingface/trl with full pr_runtime will skip most interesting PRs unless the verifier sandbox has a GPU because the trainer tests require CUDA.
Lite pipelines never use this field. When set on a harbor-provider sandbox, we pass it through to the Harbor backend's GPU config (Modal A100 / H100 / etc.).
[metadata.repo2env.reward_kinds] is a list naming the reward types this task supports. Two are defined for v0.1:
| Kind | What it is | Where the oracle lives |
|---|---|---|
diff_similarity |
SWE-RL-style sequence similarity between predicted and oracle unified diffs (returns float ∈ [0,1]) | solution/patch.diff |
test_execution |
Shell verifier writes a float to /logs/verifier/reward.txt |
tests/test.sh |
A task may emit both. The lite pipeline emits only diff_similarity; full sandbox-required pipelines emit test_execution (and may also emit diff_similarity if they capture the oracle as a diff).
The diff-similarity reward function is implemented at src/repo2rlenv/reward.py:calculate_diff_similarity_reward — pure stdlib (difflib.SequenceMatcher), Apache-2.0, no SWE-RL CC-BY-NC code vendored.
Sandbox-required tasks (pr_runtime, mutation_bugs, …) ship an environment/Dockerfile whose FROM <ref> line points at the bootstrap image — the working Docker environment for the source repo. At generate time the ref is local/r2e-bootstrap/... (un-pullable from any other machine). repo2rlenv push rewrites it in-place to one of two reproducible forms:
| Mode | FROM ref |
Reproducibility |
|---|---|---|
registry |
ghcr.io/<owner>/r2e-bootstrap-<slug>@sha256:... (or ECR / ACR / GCP AR / Docker Hub equivalent) |
Bit-exact — registry digest is immutable |
inline_dockerfile |
full apt-get / pip / ... recipe baked into environment/Dockerfile, no FROM <registry> reference |
Recipe-level — assumes mirrors stay stable; rebuilds from scratch on every harbor run |
The mode that was chosen is recorded in [metadata.repo2env.reproducibility] (see the task.toml example above), along with pushed_at, pushed_by, and — for inline mode — inline_recipe_source ∈ {user_dockerfile, agent_replay}.
repo2rlenv push decides which mode to use by running the OCI Distribution Spec L1–L4 probe protocol against every registry it finds credentials for in ~/.docker/config.json. The probe never pushes anything to the user's registry — it confirms reachability + auth + read + write by starting a blob-upload session and immediately cancelling it. Run repo2rlenv push --check-auth to see the probe output for your machine.
Push flags:
| Flag | Behaviour |
|---|---|
| (default) | Auto-detect a verified registry; fall back to inline-Dockerfile mode with a warning if none |
--image-registry <prefix> |
Force a specific registry (e.g. ghcr.io/myorg); probed for write access before push |
--inline-dockerfile |
Skip image push; bake the recipe into each task. Recipe-level reproducibility. |
--require-registry |
Hard-fail if no verified registry is available (CI / launch mode). No silent fallback. |
--skip-image-push |
Rewrite tasks against a remote ref that already exists at the registry. No docker push. |
--image-visibility public|private|inherit |
Visibility for the pushed image (GHCR auto-flips via the GitHub API). Default: match dataset. |
--check-auth |
Probe every detected registry and exit. --fast skips L3/L4; --json for CI. |
A task or dataset is conformant to v0.1 if and only if:
repo2rlenv validate <path>exits 0task.tomlis valid TOML and contains[task].name- The named
[metadata.repo2env.pipeline]matches a registered pipeline solution/patch.diffexists and is non-empty (lite pipelines)- For sandbox-required pipelines:
environment/Dockerfileandtests/test.shexist
v0.2 adds:
- Sandbox-required tasks carry
[metadata.repo2env.reproducibility]withmode ∈ {registry, inline_dockerfile, local_only}.local_onlyis pre-publication — these tasks are NOT considered reproducible by external consumers.
Pre-1.0 is a moving target — minor bumps may break readers. After 1.0 we honor strict SemVer (additive minors, breaking majors only). Each released spec version freezes its JSON Schema at a stable URL.
Adds [metadata.repo2env.reproducibility]. Additive change — v0.1.0 readers ignore the new subtable; pre-v0.2 datasets that lack it pass validation unchanged but aren't portable across machines.