Skip to content

Comments

env worker integration#1714

Merged
samsja merged 69 commits intomainfrom
env-worker
Feb 9, 2026
Merged

env worker integration#1714
samsja merged 69 commits intomainfrom
env-worker

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Feb 3, 2026

This PR integrates the new vf.EnvClient and vf.EnvServer into the PRIME-RL orchestrator -- as a ~drop-in replacement for the previous env worker. Some notes on the general design

  • The orchestrator still loads and owns a copy of each (train/eval) environment. However, this is mainly for data loading purposes (populating the buffer) and defining the scheduling logic. The orchestrator itself does not run any rollouts
  • Each loaded environment is put into "server mode" meaning that it owns a vf.EnvClient which is used internally to route request to the connected environment server. Thus, if env.env_client is present, env.run_rollout and env.run_group will execute on the server.
  • By default, environments are configured without an address. This will lead to the orchestrator auto-spawning one env server per-env as a subprocess sidecar and automatically connect a corresponding client. If an address is given, the orchestrator expects to find an environment server corresponding to this environment at the address. It will only create the env client but not spawn sidecar an env server. This pattern can be used to run env servers in separate containers
  • Because we cannot serialize a client object all calls to the environment (in server mode) pass a vf.ClientConfig instead of an actual client now. The env server creates clients automatically and caches based on the config. We use the client_idx field in the vf.ClientConfig to be able to "round-robin" clients -- basically, instead of round-robining the clients, we now round-robin the client configs. Throughout the codebase basically just the setup and types changed
  • Environment servers by default log to files according with log.vf_level verbosity. The logs are currently in the orchestrator run directory under /train/<env-name> for training envs and /eval/<env-name> for eval envs
  • I also added a new entrypoint uv run env-server which is a light-weight entrypoint to start an environment server based on our EnvConfig. This can be used to first start an env server (e.g. in container) which the orchestrator later connects to

Example

Starting an RL run with side-cared env servers works as befre

uv run rl @ examples/reverse_text/rl.toml

Optionally, you can first start a separate env server, and then let the orchestrator connect to it

uv run env-server --env.id reverse-text --env.address tcp://127.0.0.1:5000

# in separate terminal
uv run rl @ examples/reverse_text/rl.toml --orchestrator.env '[{"id": "reverse-text", "address": "tcp://127.0.0.1:5000"}]'

Manually tested features

  • reverse-text single-turn example with validation and online evals (uv run rl @ examples/multi_reverse_text/rl.toml)
  • alphabet-sort multi-turn LoRA example (uv run rl @ examples/alphabet_sort/rl.toml)
  • color-codeword multi-turn VLM example (uv run rl @ configs/multimodal/rl_color_codeword.toml)
  • elastic reverse-text with DNS discovery and elastic inference (uv run rl @ configs/elastic/rl.toml)

Notes on migration

  • The AsyncLimiter moved from the EnvWorker to the Scheduler which and control the number of group rollout requests per minute (this should exactly match the previous behavior)
  • The ServerDiscovery that lived on the env worker before got deprecated and moved onto the
  • Eval environments for online evals use the same pattern of remote server execution. This is opposed to the previous pattern where a single subprocesses ran all of the evals. However, we still pause the orchestrator and drain pending reqs on the scheduler before starting online evals
  • Somewhat unrelated to this PR, but I reuse the InferencePool abstraction for the OPD teacher clients as well which gets rid of some redundant code. There is more but I will leave that for later
  • We deprecate eval and synthesize (and all associated tests/ utils/ ...)

Questions/ Discussion

  • @samsja Is it fine to remove eval? What exactly does the watcher do that we cannot do in online eval?
  • @JannikSt Does this work for you for running env workers in containers? I guess you would have to parse the orchestrator config for the training and eval envs and start each env in a container and then "modify" the orchestrator config to include the addresses -- is this feasible?
  • @JannikSt On the env worker we had auto-restarts-- what do we need this for? would restarts not be handled by k8s? let me know how we should best handle this but the vf.EnvClient probably needs some more resilience
  • We currently rely on this branch from verifiers which allows each env in vf.EnvGroup (which we use for our training envs) to own a custom env client that is used iif the child env is in "server mode". Long-term we might want to depcreate the use of the env group because it doesn't actually do that much for us apart from concatenating the dataset and routing to envs by task
  • On the env server we currently use the vf logger (standard lib). For hosted RL it would prob be nice to have consistent logging and use PRIME-RL logger. It's not super trivial to do this nice, so I will leave this for later
  • It would be cool to automatically surface all env logs in the tmux session by default. One option would be to tail -F outputs/<run>/train/* for train envs
  • We should prob turn off any logging from verifiers on the orchestrator by default. It's rarely gives useful/new information but will make this part of another PR

Missing Features

  • Auto-restarts upon crash: Right now if an env server dies, the env client crashes which will crash the orchestrator
  • Wait for env server: Right now we only wait for 30s for an env server to become alive
  • Logging: We are not logging the event loop lag from the env server as of now. Doing so would prob require us to override the vf.EnvServer and inject W&B monitoring
  • Concurrency:
  • Multiple servers per env: Not planned for now.

Note

High Risk
Refactors core rollout generation to run via external vf.EnvServer processes and changes inference client plumbing/types, which can impact training stability, performance, and failure modes. Also removes standalone eval/synthesize tooling and CI integration tests, which may reduce coverage and change user workflows.

Overview
The orchestrator now runs rollouts through verifiers EnvServer sidecars (or remote servers via EnvConfig.address) by wiring each env to a vf.EnvClient, replacing the previous per-env worker subprocess implementation and moving rate limiting into the Scheduler.

Online evals are reworked to evaluate via the same env-server mechanism (no eval subprocess), with new eval_utils/vf_utils helpers, and EnvConfig gains address plus orchestrator-managed extra_env_kwargs. The PR also removes the standalone eval and synthesize entrypoints/configs and drops the GitHub Actions integration-test job, while adding a new env-server CLI entrypoint and updating docs/config examples accordingly.

Written by Cursor Bugbot for commit 02a29d3. This will update automatically on new commits. Configure here.

@mikasenghaas mikasenghaas requested a review from samsja February 6, 2026 13:10
vf_logger.handlers.clear()
vf_logger.addHandler(InterceptHandler(prefix=prefix))
vf_logger.setLevel(level.upper())
vf_logger.propagate = False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused function intercept_vf_logging defined but never called

Low Severity

The function intercept_vf_logging is defined in vf_utils.py but is never called anywhere in the codebase. The grep search shows the only match is its definition. The orchestrator uses vf.setup_logging(level="CRITICAL") instead. This appears to be dead code that was added but never wired up.

Fix in Cursor Fix in Web

willccbb and others added 2 commits February 8, 2026 17:58
The health check was using `model_name` which gets updated to the LoRA
adapter name after training starts. New inference servers only have the
base model, so they would fail health checks and never get added to the
pool.

Store the original model name as `base_model_name` and use it for health
checks, allowing new servers to be discovered and have the LoRA adapter
loaded on them.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

self._clients = clients
self._admin_clients = admin_clients
self._skip_model_check = skip_model_check
self._idx_to_client = {client.client_idx: client for client in clients}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused _idx_to_client dict never accessed

Low Severity

The _idx_to_client dict is created in StaticInferencePool.__init__ but is never read or accessed anywhere in the class or codebase. This is dead code that adds unnecessary memory allocation and computation.

Fix in Cursor Fix in Web

@samsja samsja merged commit 31b48b8 into main Feb 9, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants