This repository demonstrates the layout and code artifacts that the Osmosis GitHub integration discovers and syncs. It bundles a FastMCP tool server, a numeric reward function, and a rubric-based scorer so you can exercise every integration surface locally before connecting to Osmosis.
- Python 3.10+ (3.12 recommended, matches
pyproject.toml) - Install dependencies:
pip install "osmosis-ai[mcp]"(oruv add "osmosis-ai[mcp]")
# Optional: create a virtual environment first
pip install "osmosis-ai[mcp]" # install dependencies
# Evaluate MCP tools against the sample dataset (requires OPENAI_API_KEY)
osmosis eval --mcp ./mcp -d test_data.jsonl \
--eval-fn reward_fn.compute_reward:numbers_match_reward \
--model my-finetuned-model \
--base-url http://localhost:1234/v1
# Or test without eval functions
osmosis test --mcp ./mcp -d test_data.jsonl --model openai/gpt-5-mini
# Start the MCP server directly (for other use cases)
python mcp/main.py & # start the FastMCP server on 0.0.0.0:8080
python mcp/test/test.py # list the published tools
# Run the rubric examples
./scripts/run_reward_rubric_openai.sh # requires OPENAI_API_KEY
./scripts/run_reward_rubric_anthropic.sh # requires ANTHROPIC_API_KEYStop the MCP server with Ctrl+C when you are done, or pass --host/--port if you need to bind to a different interface.
osmosis-git-sync-example/
├── mcp/
│ ├── main.py
│ ├── server/
│ │ ├── __init__.py
│ │ └── mcp_server.py
│ ├── tools/
│ │ ├── __init__.py
│ │ └── math.py
│ └── test/
│ └── test.py
├── reward_fn/
│ └── compute_reward.py
├── reward_rubric/
│ ├── reward_rubric_anthropic.py
│ ├── reward_rubric_openai.py
│ └── reward_rubric_xai.py
├── .github/
│ └── workflows/
│ └── reward_rubric.yml
├── scripts/
│ ├── run_reward_rubric_anthropic.sh
│ └── run_reward_rubric_openai.sh
├── test_data.jsonl ← sample dataset for osmosis eval/test
├── LICENSE.md
├── pyproject.toml
├── uv.lock
└── README.md
main.pystarts the FastMCP HTTP transport (python mcp/main.py) and accepts--host/--port.server/mcp_server.pyinstantiatesFastMCP("OsmosisTools")and exposes a/healthroute.tools/__init__.pyexposes every.pymodule in the folder via__all__, sofrom tools import *eagerly loads each tool module.- Tool modules:
math.multiply(first_val, second_val)multiplies two numbers and rounds to four decimals.
test/test.pyshows how to connect withfastmcp.Clientand list the published tools.
compute_reward.pydefinesnumbers_match_reward(...)decorated with@osmosis_reward.extract_solutiongrabs the first numeric token that follows a markdown-style####heading and returns it as text.- The reward converts the extracted token and ground truth to floats, awarding
1.0when they match within1e-7and0.0otherwise (including extraction failures).
This folder contains simplified, provider-specific rubric scoring examples that demonstrate how to use @osmosis_rubric with different LLM providers:
reward_rubric_anthropic.py– Uses Anthropic's Claude (claude-sonnet-4-5-20250929) for rubric evaluation. RequiresANTHROPIC_API_KEYenvironment variable.reward_rubric_openai.py– Uses OpenAI's GPT (gpt-5-mini) for rubric evaluation. RequiresOPENAI_API_KEYenvironment variable.reward_rubric_xai.py– Uses xAI's Grok (grok-4-fast-non-reasoning) for rubric evaluation. RequiresXAI_API_KEYenvironment variable.
All files:
- Define an
@osmosis_rubricdecorated function that delegates scoring toosmosis_ai.evaluate_rubric - Use hardcoded rubric text, score ranges (0.0-1.0), and model configurations for simplicity
- Can be imported and called directly in your Python code or executed as standalone modules
- Accept
solution_str(the text to evaluate),ground_truth(reference answer), andextra_info(metadata dictionary)
reward_rubric.ymlruns the rubric scorers in GitHub Actions whenever files inreward_rubric/change on a push or pull request. The job installs the package, injects API keys via secrets, and executes both rubric scripts so reviewers can see automated scores from multiple providers.
run_reward_rubric_openai.shexecutes the OpenAI-based rubric scorer. EnsureOPENAI_API_KEYis available in the environment before executing.run_reward_rubric_anthropic.shexecutes the Anthropic-based rubric scorer. EnsureANTHROPIC_API_KEYis available in the environment before executing.
# Local Rollout (MCP tools) — recommended for this repo
pip install "osmosis-ai[mcp]"
# Or, if you use uv
uv add "osmosis-ai[mcp]"
# Other install extras:
# pip install osmosis-ai # Core SDK only
# pip install "osmosis-ai[server]" # FastAPI server for Remote Rollout
# pip install "osmosis-ai[full]" # All features# Default: 0.0.0.0:8080
python mcp/main.py
# Custom host/port
python mcp/main.py --host 127.0.0.1 --port 3000
# Health check
curl http://localhost:8080/healthpython mcp/test/test.pyThe script connects to http://0.0.0.0:8080/mcp, confirms the session, and lists the registered tools.
Make sure dependencies are installed (see Installing dependencies).
export OPENAI_API_KEY=sk-your-key
./scripts/run_reward_rubric_openai.shOr call the function directly in Python:
from reward_rubric.reward_rubric_openai import compute_rubric_score_openai
score = compute_rubric_score_openai(
solution_str="The predicted value is 42",
ground_truth="42",
extra_info={"metadata": {"context": "test"}}
)
print(f"Score: {score}")export ANTHROPIC_API_KEY=sk-ant-your-key
./scripts/run_reward_rubric_anthropic.shOr call the function directly in Python:
from reward_rubric.reward_rubric_anthropic import compute_rubric_score_anthropic
score = compute_rubric_score_anthropic(
solution_str="The predicted value is 42",
ground_truth="42",
extra_info={"metadata": {"context": "test"}}
)
print(f"Score: {score}")Both scripts will evaluate whether the solution matches the ground truth and return a score between 0.0 and 1.0.
Git-sync users can evaluate and test their MCP tools locally using --mcp, without writing a RolloutAgentLoop. The SDK loads all @mcp.tool() functions from the mcp/ directory and runs a standard agent loop automatically.
pip install "osmosis-ai[mcp]"osmosis eval runs your MCP tools against a dataset, scores each run with eval functions, and reports aggregated metrics (mean, std, pass@k).
# Basic eval — uses reward_fn/compute_reward.py as the eval function
osmosis eval \
--mcp ./mcp \
-d test_data.jsonl \
--eval-fn reward_fn.compute_reward:numbers_match_reward \
--model openai/gpt-5-mini
# Eval against a trained model endpoint
osmosis eval \
--mcp ./mcp \
-d test_data.jsonl \
--eval-fn reward_fn.compute_reward:numbers_match_reward \
--model my-finetuned-model \
--base-url http://localhost:8000/v1
# Compare trained model vs GPT-5-mini baseline (win/loss/tie report)
osmosis eval \
--mcp ./mcp \
-d test_data.jsonl \
--eval-fn reward_fn.compute_reward:numbers_match_reward \
--model my-finetuned-model --base-url http://localhost:8000/v1 \
--baseline-model openai/gpt-5-mini
# Compare two serving endpoints
osmosis eval \
--mcp ./mcp \
-d test_data.jsonl \
--eval-fn reward_fn.compute_reward:numbers_match_reward \
--model my-model-v2 --base-url http://localhost:8000/v1 \
--baseline-model my-model-v1 --baseline-base-url http://localhost:8001/v1
# pass@5 analysis with concurrent execution
osmosis eval \
--mcp ./mcp \
-d test_data.jsonl \
--eval-fn reward_fn.compute_reward:numbers_match_reward \
--model openai/gpt-5-mini \
--n 5 --batch-size 5
# Save results to JSON
osmosis eval \
--mcp ./mcp \
-d test_data.jsonl \
--eval-fn reward_fn.compute_reward:numbers_match_reward \
--model openai/gpt-5-mini \
-o eval_results.jsonosmosis test runs your MCP tools against a dataset and reports per-row pass/fail, token usage, and latency — useful for validating agent behavior before training.
# Basic test
osmosis test \
--mcp ./mcp \
-d test_data.jsonl \
--model openai/gpt-5-mini
# Interactive debugging — step through each LLM call
osmosis test \
--mcp ./mcp \
-d test_data.jsonl \
--model openai/gpt-5-mini \
--interactive
# Jump to a specific row in interactive mode
osmosis test \
--mcp ./mcp \
-d test_data.jsonl \
--model openai/gpt-5-mini \
--interactive --row 3
# Test a subset of rows
osmosis test \
--mcp ./mcp \
-d test_data.jsonl \
--model openai/gpt-5-mini \
--limit 2 --offset 1
# Save results to JSON
osmosis test \
--mcp ./mcp \
-d test_data.jsonl \
--model openai/gpt-5-mini \
-o test_results.jsonWhen you pass --mcp ./mcp, the SDK:
- Imports
mcp/main.py, which triggers all@mcp.tool()registrations - Discovers registered tools (e.g.
multiply) and converts them to OpenAI function-calling schemas - Runs a built-in agent loop that calls the LLM, executes tool calls against your MCP functions, and repeats until the LLM stops calling tools or
--max-turnsis reached
This means you can iterate on your tools and reward functions locally, then push to GitHub for Osmosis to sync — no RolloutAgentLoop code needed.
Note:
--mcpand-m/--moduleare mutually exclusive. Use--mcpfor git-sync projects; use-mfor remote-rollout projects that implementRolloutAgentLoop.
Models can be specified in two formats:
- Simple:
gpt-5-mini(auto-prefixed toopenai/gpt-5-mini) - LiteLLM format:
provider/model(e.g.,anthropic/claude-sonnet-4-5)
See LiteLLM Providers for the full list of supported providers.
The -d/--dataset flag accepts three formats:
- Parquet (recommended) — compact and fast for large datasets
- JSONL — one JSON object per line (used in this repo)
- CSV — comma-separated values with a header row
Each row must contain: system_prompt, user_prompt, and ground_truth columns. Any additional columns are passed as metadata to your agent and reward functions.
Log in to the Osmosis platform for workspace management and training run submission:
osmosis login # Opens browser for authentication
osmosis logout # End session and revoke credentials
osmosis whoami # Show current user and workspaces
osmosis workspace list # List all logged-in workspaces
osmosis workspace switch <name> # Switch to a different workspaceosmosis preview --path test_data.jsonl # Preview a datasetosmosis eval-rubric evaluates conversations against hosted rubric configurations. This is separate from osmosis eval, which runs eval functions against agent datasets.
osmosis eval-rubric --rubric support_followup --data test_data.jsonl- Review the workflow definition:
.github/workflows/reward_rubric.ymlships with this repo. It installs the package and runs the rubric scorers so you can see the current scores each time the workflow executes. - Create the expected environment: In your GitHub repository, open Settings → Environments, click New environment, and name it
osmosis-secrets(the workflow references this environment). - Add environment secrets: Inside the
osmosis-secretsenvironment, use Add environment secret to provide the keys required for evaluation. Add bothOPENAI_API_KEYandANTHROPIC_API_KEYto test both providers. You can also addGOOGLE_API_KEYorXAI_API_KEYif you plan to extend the examples. - Understand the trigger: Any push or pull request that modifies files in
reward_rubric/automatically runs the workflow so you can check the revised scores before merging the change. - Review and re-run: After each run, open the Actions tab to inspect the job logs. Use Re-run jobs for the most recent commit, or add a
workflow_dispatchtrigger if you want to run the scorer on demand.
- MCP tools – Every function decorated with
@mcp.tool(or@mcp.tool()) insidemcp/tools/is ingested, including type hints and docstrings. - Reward functions – Functions decorated with
@osmosis_rewardinreward_fn/become numeric reward hooks for reinforcement learning pipelines. - Reward rubrics – Functions decorated with
@osmosis_rubricinreward_rubric/are registered so hosted models can score conversations using your rubric text.
Keep type hints, docstrings, and configuration files current so the Osmosis sync remains accurate.