Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .cursor/BUGBOT.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,30 @@ Any PR that adds or removes an environment from the `environments/` folder must
- Update the "What to look at for each pattern" section if applicable

If an environment is added or removed without a corresponding `environments/README.md` update, request that the author add the necessary changes.

## Skills Updates

Any PR that changes user-facing Prime or Verifiers workflows for environment development, browsing, review, evaluation, GEPA optimization, or RL training must update the corresponding skills under `skills/`.

This includes changes to command contracts, defaults, or behavior in:

- `docs/overview.md`
- `docs/environments.md`
- `docs/evaluation.md`
- `docs/training.md`
- `docs/faqs.md`
- `docs/prime_cli_verifiers_unification_design.md`
- `verifiers/scripts/*.py`
- `verifiers/cli/plugins/prime.py`

When these files change, verify and update any affected skill files:

- `skills/create-environments/SKILL.md`
- `skills/browse-environments/SKILL.md`
- `skills/review-environments/SKILL.md`
- `skills/evaluate-environments/SKILL.md`
- `skills/optimize-with-environments/SKILL.md`
- `skills/train-with-environments/SKILL.md`
- `skills/brainstorm/SKILL.md`

If workflow-relevant changes are detected without matching skill updates, request that the author update the impacted skills before merge.
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,12 @@ prime lab setup
This sets up a Python project if needed (with `uv init`), installs `verifiers` (with `uv add verifiers`), creates the recommended workspace structure, and downloads useful starter files:
```
configs/
├── endpoints.py # OpenAI-compatible API endpoint configuration
└── lab/ # Example configs for Hosted Training
├── endpoints.toml # OpenAI-compatible API endpoint configuration
├── rl/ # Example configs for Hosted Training
├── eval/ # Example multi-environment eval configs
└── gepa/ # Example configs for prompt optimization
.prime/
└── skills/ # Bundled workflow skills for create/browse/review/eval/GEPA/train/brainstorm
environments/
└── AGENTS.md # Documentation for AI coding agents
AGENTS.md # Top-level documentation for AI coding agents
Expand Down Expand Up @@ -136,7 +140,7 @@ To run a local evaluation with any OpenAI-compatible model, do:
```bash
prime eval run my-env -m gpt-5-nano # run and save eval results locally
```
Evaluations use [Prime Inference](https://docs.primeintellect.ai/inference/overview) by default; configure your own API endpoints in `./configs/endpoints.py`.
Evaluations use [Prime Inference](https://docs.primeintellect.ai/inference/overview) by default; configure your own API endpoints in `./configs/endpoints.toml`.

View local evaluation results in the terminal UI:
```bash
Expand Down
2 changes: 2 additions & 0 deletions assets/agents/end_user_best_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

Use this guidance in projects created via `prime lab setup`.

- Treat `.prime/skills/` as the canonical skill entrypoint in Lab workspaces. Use the bundled skills first for create/browse/review/eval/GEPA/train/brainstorm workflows before ad hoc approaches.
- Keep endpoint aliases in `./configs/endpoints.toml` and use `endpoint_id`/model shortcuts in commands and configs.
- Use the documented workspace flow: `prime env init` → `prime env install` → `prime eval run`.
- Keep each environment self-contained under `environments/<env_name>/` with `pyproject.toml`, implementation, and README.
- Document required environment variables in README and validate missing keys early with `vf.ensure_keys(...)`.
Expand Down
2 changes: 2 additions & 0 deletions assets/lab/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ These points are direct restatements of Verifiers docs so agents can follow the

Use this guidance in projects created via `prime lab setup`.

- Treat `.prime/skills/` as the canonical skill entrypoint in Lab workspaces. Use the bundled skills first for create/browse/review/eval/GEPA/train/brainstorm workflows before ad hoc approaches.
- Keep endpoint aliases in `./configs/endpoints.toml` and use `endpoint_id`/model shortcuts in commands and configs.
- Use the documented workspace flow: `prime env init` → `prime env install` → `prime eval run`.
- Keep each environment self-contained under `environments/<env_name>/` with `pyproject.toml`, implementation, and README.
- Document required environment variables in README and validate missing keys early with `vf.ensure_keys(...)`.
Expand Down
27 changes: 8 additions & 19 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,34 +66,23 @@ prime eval run my-env -x '{"max_turns": 20}'
| `--model` | `-m` | `openai/gpt-4.1-mini` | Model name or endpoint alias |
| `--api-base-url` | `-b` | `https://api.pinference.ai/api/v1` | API base URL |
| `--api-key-var` | `-k` | `PRIME_API_KEY` | Environment variable containing API key |
| `--endpoints-path` | `-e` | `./configs/endpoints.toml` | Path to endpoints registry (`.toml` preferred, `.py` supported) |
| `--endpoints-path` | `-e` | `./configs/endpoints.toml` | Path to TOML endpoints registry |
| `--header` | — | — | Extra HTTP header (`Name: Value`), repeatable |

For convenience, define model endpoints in `./configs/endpoints.toml` (or `./configs/endpoints.py`) to avoid repeating URL and key flags.

```python
ENDPOINTS = {
"gpt-4.1-mini": {
"model": "gpt-4.1-mini",
"url": "https://api.openai.com/v1",
"key": "OPENAI_API_KEY",
},
"qwen3-235b-i": {
"model": "qwen/qwen3-235b-a22b-instruct-2507",
"url": "https://api.pinference.ai/api/v1",
"key": "PRIME_API_KEY",
},
}
```

Equivalent TOML format:
For convenience, define model endpoints in `./configs/endpoints.toml` to avoid repeating URL and key flags.

```toml
[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-235b-i"
model = "qwen/qwen3-235b-a22b-instruct-2507"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"
```

To define equivalent replicas, add multiple `[[endpoint]]` entries with the same `endpoint_id`.
Expand Down
2 changes: 2 additions & 0 deletions docs/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ configs/
├── rl/ # Example configs for Hosted Training
├── eval/ # Example multi-environment eval configs
└── gepa/ # Example configs for prompt optimization
.prime/
└── skills/ # Bundled workflow skills for create/browse/review/eval/GEPA/train/brainstorm
environments/
└── AGENTS.md # Documentation for AI coding agents
AGENTS.md # Top-level documentation for AI coding agents
Expand Down
2 changes: 1 addition & 1 deletion docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Use the `prime lab setup` script to download example configuration files for Hos
prime lab setup
```

This will download example TOML configs for Hosted Training into `configs/rl/`, example eval configs into `configs/eval/`, along with `endpoints.toml` and GEPA starter configs in `configs/gepa/`:
This will download example TOML configs for Hosted Training into `configs/rl/`, example eval configs into `configs/eval/`, along with `configs/endpoints.toml` and GEPA starter configs in `configs/gepa/`:

```
configs/
Expand Down
54 changes: 54 additions & 0 deletions skills/brainstorm/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
name: brainstorm
description: Run interactive brainstorming across verifiers environments, evaluations, GEPA, and RL training. Use when the user wants ideation, literature scanning, concept teaching, roadmap planning, or research program design grounded in local CLI sources, verifiers, and RL trainer code.
---

# Brainstorm

## Goal
Run structured, interactive ideation that turns ambiguous research goals into concrete environment and evaluation plans.

## Interaction Style
1. Drive an iterative conversation, not a one-shot dump.
2. Ask focused clarifying questions before proposing large plans.
3. Keep suggestions toolchain-native: CLI, verifiers, and RL trainer workflows.

## Discovery Workflow
1. Clarify objective, model family, budget, and timeline.
2. Map objective to workflow levers:
- environment creation or migration
- benchmark/eval design
- GEPA prompt optimization
- RL training
3. Build a short option set, then deepen only selected options.
4. Nudge model-family intent explicitly:
- Instruct-first exploration defaults: `gpt-4.1` series, `qwen3` instruct series.
- Reasoning-first exploration defaults: `gpt-5` series, `qwen3` thinking series, `glm` series.
- Recommend endpoint aliases in `configs/endpoints.toml` for repeatable experiments.

## Required Grounding Sources
1. Read local source before proposing workflows:
- `~/dev/prime-cli`
- `~/dev/prime-rl` (clone to `/tmp` only if needed)
- current verifiers workspace docs/configs
2. For literature and external eval ideas, browse web sources and prioritize mid-2025 onward unless the user asks otherwise.
3. Include dates when discussing recent papers or benchmarks.

## Concept Teaching Mode
When asked to explain RL or environment concepts:
1. Anchor explanations in prime-rl and verifiers terminology.
2. Use concrete config and rollout examples.
3. Distinguish binary-reward and continuous-reward training implications.

## Planning Output Format
Produce:
1. Problem framing and assumptions.
2. Candidate environment or eval ideas, ranked by expected value and implementation effort.
3. Experiment plan with milestones, metrics, and go/no-go gates.
4. Risks, dependencies, and required decisions from the user.
5. Distribution plan for mature environments: recommend Hub push after smoke-test stability and ask whether visibility should be `PUBLIC` or `PRIVATE`.

## Quality Guardrails
1. Do not make hidden assumptions about benchmark prompt formatting or scoring contracts.
2. Flag platform limitations clearly and pause for user direction when blocked.
3. Prefer official first-party capabilities before suggesting custom third-party tooling.
67 changes: 67 additions & 0 deletions skills/browse-environments/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
name: browse-environments
description: Discover and inspect verifiers environments through the Prime ecosystem. Use when asked to find environments on the Hub, compare options, inspect metadata, check action status, pull local copies for inspection, or choose environment starting points before evaluation, training, or migration work.
---

# Browse Environments

## Goal
Use Prime ecosystem commands to discover environments quickly, inspect quality signals, and pick the right starting point.

## Primary Discovery Workflow
1. List candidate environments:
```bash
prime env list --search "math" --sort stars --show-actions
```
2. Narrow results with owner, tags, mine, or starred filters:
```bash
prime env list --owner primeintellect --tag tools --tag sandbox
prime env list --mine
prime env list --starred
```
3. Inspect details for shortlisted candidates:
```bash
prime env info owner/name
prime env status owner/name
```
4. Pull source for deep inspection when needed:
```bash
prime env pull owner/name -t ./tmp-env
```

## Compare Candidates
For each candidate, collect:
1. Task type and horizon: single-turn, multi-turn, tool, sandbox.
2. Reward type: binary, continuous, judge-based, mixed.
3. Dependencies and secrets requirements.
4. Latest action status and version signal.
5. Fit to user goal: eval-only, GEPA, RL, or benchmark migration.

## Endpoint And Model Selection Nudge
1. Encourage users to configure endpoint aliases in `configs/endpoints.toml` before comparison evals.
2. Ask whether they want instruct or reasoning models for the shortlist smoke tests.
3. Instruct go-tos: `gpt-4.1` series, `qwen3` instruct series.
4. Reasoning go-tos: `gpt-5` series, `qwen3` thinking series, `glm` series.

## Prefer Official Ecosystem Paths
1. Prefer Hub and Prime CLI workflows before manual third-party setup.
2. Use install + smoke eval to validate real usability:
```bash
prime env install owner/name
prime eval run name -m gpt-4.1-mini -n 5
```
3. For examples in the verifiers repository, use repo install path when available:
```bash
prime env install reverse-text --from-repo
```

## Anti-Patterns
1. Do not recommend building from scratch if a strong ecosystem option exists.
2. Do not rely on README claims without running at least one quick eval.
3. Do not hide incompatibilities or missing dependencies.

## Output Format
Return:
1. Ranked shortlist with one-line rationale per environment.
2. Exact commands to install and run each shortlisted option.
3. Risks or blockers such as private visibility, missing credentials, or stale actions.
107 changes: 107 additions & 0 deletions skills/create-environments/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
name: create-environments
description: Create or migrate verifiers environments for the Prime Lab ecosystem. Use when asked to build a new environment from scratch, port an eval or benchmark from papers or other libraries, start from an environment on the Hub, or convert existing tasks into a package that exposes load_environment and installs cleanly with prime env install.
---

# Create Environments

## Goal
Build production-quality verifiers environments that work immediately in the Prime ecosystem: install, load, evaluate, and train without hidden setup.

## Start With Ecosystem Paths
1. Prefer ecosystem-native setup before custom scaffolding.
2. Use this default loop:
```bash
prime env init my-env
prime env install my-env
prime eval run my-env -m gpt-4.1-mini -n 5
```
3. Prefer an existing environment as a starting point when possible:
```bash
prime env list --search "keyword"
prime env info owner/name
prime env install owner/name
```
4. For repository examples, use repo install when available:
```bash
prime env install math-python --from-repo
```
5. Encourage users to keep endpoint aliases in `configs/endpoints.toml` so smoke tests can switch models quickly.
6. Ask users whether they want instruct or reasoning models for validation.
7. Instruct-first smoke choices: `gpt-4.1` series, `qwen3` instruct series.
8. Reasoning validation choices: `gpt-5` series, `qwen3` thinking series, `glm` series.

## Build Modes

### 1. Build From Scratch
1. Define task contract first: prompt shape, allowed tools, stop conditions, rubric outputs, metrics.
2. Select the smallest correct base class:
- `SingleTurnEnv` for one-response tasks.
- `MultiTurnEnv` for custom interaction loops.
- `ToolEnv` or `MCPEnv` for stateless tools.
- `StatefulToolEnv` for per-rollout resources.
3. Implement `load_environment(...) -> vf.Environment` with explicit arguments.
4. Add `pyproject.toml` defaults in `[tool.verifiers.eval]` only when stable.

### 2. Port From Another Library, Project, or Paper
1. Create a strict source-to-target mapping before coding:
- dataset rows and splits
- prompt rendering and role ordering
- tool I/O schema and stop logic
- scoring math and aggregation
- pass/fail thresholds and special cases
2. Preserve one-to-one logical equivalence for what the model sees and what gets scored.
3. Never invent unresolved formatting decisions. Ask the user to decide explicitly.
4. Benchmark runtime and remove avoidable bottlenecks before handoff.

### 3. Start From Hub Environment
1. Install or pull the closest baseline:
```bash
prime env install owner/name
prime env pull owner/name -t ./tmp-env
```
2. Keep proven interfaces stable unless a migration is deliberate and explicit.
3. Re-run smoke evals after each major change.

## Non-Negotiable Quality Rules
1. Use deterministic, well-defined reward checks or LLM judges.
2. Avoid best-effort deterministic heuristics such as keyword style checks except as an explicit last resort with user sign-off.
3. Make environments self-contained after install. Do not require users to run background servers before `load_environment()`.
4. Manage external resources inside the environment lifecycle.
5. Validate required secrets in `load_environment()` via `vf.ensure_keys(...)`.
6. Surface feature limits directly. Do not ship hacky workarounds without explicit user approval.

## Verification Gate
Run these before claiming completion:
```bash
prime env install my-env
prime eval run my-env -m gpt-4.1-mini -n 5
prime eval run my-env -m gpt-4.1-mini -n 50 -r 1 -s
```
If multi-turn or tool-heavy, also run with higher rollouts:
```bash
prime eval run my-env -m gpt-4.1-mini -n 30 -r 3 -s
```

## Publish Gate Before Large Evals Or Training
1. After smoke tests pass and behavior is stable, recommend pushing to Hub before large evals or RL training.
2. Ask the user explicitly whether visibility should be `PUBLIC` or `PRIVATE`.
3. Use:
```bash
prime env push --path ./environments/my_env --visibility PUBLIC
```
or
```bash
prime env push --path ./environments/my_env --visibility PRIVATE
```
4. For hosted or large-scale workflows, prefer running with the Hub slug after push:
```bash
prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s
```

## Deliverable Format
Report:
1. Environment ID and path.
2. Exact install and eval commands used.
3. Port-equivalence notes if migrated.
4. Any unresolved user decisions that block strict fidelity.
Loading
Loading