-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
GRPO Comprehensive Review — Tracking Issue
This issue tracks all findings from the GRPO code review across both openadapt-ml and openadapt-evals.
Critical — Blocks production training
- feat(grpo): replace custom training loop with TRL GRPOTrainer #35 — Replace custom training loop with TRL GRPOTrainer (openadapt-ml)
- feat(rl-env): support parallel rollout collection via VM pool openadapt-evals#76 — Support parallel rollout collection via VM pool (openadapt-evals)
- feat(rl-env): populate observation with task instruction openadapt-evals#77 — Populate observation with task instruction (openadapt-evals)
Important — Required for external integration / RL training use case
- feat(grpo): add pluggable action format (DSL vs JSON) #36 — Add pluggable action format (DSL vs JSON) (openadapt-ml)
- feat(grpo): extend action types (scroll, key, noop) #37 — Extend action types (scroll, key, noop) (openadapt-ml)
- feat(grpo): add pluggable reward functions / verifier registry #38 — Add pluggable reward functions / verifier registry (openadapt-ml)
- test(grpo): add training loop tests with mocked environment #39 — Add training loop tests with mocked environment (openadapt-ml)
Medium — Reliability and performance
- feat(grpo): add checkpoint resume #40 — Add checkpoint resume (openadapt-ml)
- perf(grpo): use PEFT named adapters for reference policy #41 — Use PEFT named adapters for reference policy (openadapt-ml)
- feat(rl-env): add health check and error recovery for rollout collection openadapt-evals#78 — Add health check and error recovery for rollout collection (openadapt-evals)
Batch 2 — Prompt, data structure, VLM, and correctness fixes
- fix(grpo): align prompt format across SFT, GRPO, and CoT warmup #43 — Align prompt format across SFT, GRPO, and CoT warmup (openadapt-ml, bug)
- refactor(grpo): replace _grpo_raw_text monkey-patch with proper data structure #44 — Replace _grpo_raw_text monkey-patch with proper data structure (openadapt-ml, refactor)
- fix(grpo): VLM processor metadata misalignment after token concatenation #45 — VLM processor metadata misalignment after token concatenation (openadapt-ml, bug)
- fix(grpo): inner tokenizer extraction bypasses processor preprocessing #46 — Inner tokenizer extraction bypasses processor preprocessing (openadapt-ml, bug)
- perf(grpo): use JSONL append-only format for training log #47 — Use JSONL append-only format for training log (openadapt-ml, performance)
- feat(grpo): add gradient accumulation across multiple groups #48 — Add gradient accumulation across multiple groups (openadapt-ml, enhancement)
- fix(grpo): coordinate precision loss in fraction-pixel roundtrip #49 — Coordinate precision loss in fraction-pixel roundtrip (openadapt-ml, bug)
- fix(grpo): minor issues (regex whitespace, CoT adapter reuse, shutil import) #50 — Minor issues: regex whitespace, CoT adapter reuse, shutil import (openadapt-ml, bug)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request