feat(grpo): add gradient accumulation across multiple groups

## Problem

`_training_step` processes a single group of rollouts per optimizer step. There is no mechanism to accumulate gradients across multiple groups before stepping. With binary rewards and small group sizes (N=8), many steps have all-zero or all-one rewards producing zero gradient signal. Accumulating across 2-4 groups before stepping would improve stability and reduce wasted compute.

Also related: per-step backward prevents multi-step batching for GPU efficiency. Consider batching steps that share the same image for more efficient forward passes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(grpo): add gradient accumulation across multiple groups #48

Problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(grpo): add gradient accumulation across multiple groups #48

Description

Problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions