ONERB-5: prep first SFT Modal run by olety · Pull Request #5 · oneiron-dev/oneiroNER

olety · 2026-07-02T10:27:46Z

No description provided.

coderabbitai · 2026-07-02T10:27:57Z

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 47 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 01a167cf-8fa7-4f66-ba57-d01bb3369dff

📥 Commits

Reviewing files that changed from the base of the PR and between c1a0bec and b237ad3.

📒 Files selected for processing (3)

docs/plan/sft-run-1.md
scripts/modal_train.py
tests/test_train_smoke.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/onerb-5

_{Comment @coderabbitai help to get the list of available commands.}

qodo-code-review · 2026-07-02T10:29:10Z

PR Summary by Qodo

ONERB-5: Add Modal SFT run-1 recipe, entrypoint, and smoke unit test

✨ Enhancement 📝 Documentation 🧪 Tests 🕐 40+ Minutes

AI Description

• Add Modal entrypoint to run SFT smoke step and full training on GPU.
• Cache private HF dataset splits and checkpoints in the ner-data Modal Volume.
• Document run commands/costs and add a dataset/label-alignment smoke unit test.

Diagram

graph TD
  A["CLI"] --> B["modal run"] --> C["Modal App"] --> D["scripts/modal_train.py"]
  D --> E{{"HF dataset"}}
  D --> G["model.train / smoke"] --> F[(ner-data)]
  G --> H{{"W&B"}}
  D --> F

  subgraph Legend
    direction LR
    _svc["Service/Script"] ~~~ _vol[(Volume)] ~~~ _ext{{"External"}}
  end

High-Level Assessment

The following are alternative approaches to this PR:

1. Use `model.train --max-steps 1` for smoke (no custom torch loop)

➕ Single training entrypoint for both smoke and full runs
➕ Exercise the exact Trainer/mix-sampling stack used in full training
➖ More moving parts for a first “does it run at all” GPU smoke
➖ Harder to guarantee an immediate checkpoint write after exactly one step

2. Pre-stage dataset JSONLs in the Modal Volume (no HF download at runtime)

➕ Faster and more deterministic job startup
➕ Avoids transient HF/network failures during training runs
➖ Requires a separate ingestion job/process and operational coordination
➖ Harder to ensure the volume contents match a specific dataset revision

Recommendation: Current approach is a good first-run setup: the dedicated one-step smoke validates tokenization/labeling/model forward/backward on real private rows with minimal overhead, and the full mode switches to a durable snapshot-based materialization for long training. Consider the model.train --max-steps 1 smoke alternative later if you want the smoke to cover the full Trainer stack, but keeping the lightweight smoke loop is a pragmatic reliability choice for ONERB-5.

Files changed (3) +619 / -0

Enhancement (1) +401 / -0

modal_train.pyIntroduce Modal runner for smoke step and full SFT training +401/-0

Introduce Modal runner for smoke step and full SFT training

• Adds a Modal app that mounts the 'ner-data' volume, pulls a private HF dataset (streaming for smoke; snapshot for full), and runs either a one-step torch smoke or full 'python -m model.train' as a subprocess. Writes checkpoints/metadata to the volume, logs key metrics to W&B, and emits structured 'ONERB5_*' log lines for evidence collection.

scripts/modal_train.py

Tests (1) +114 / -0

test_train_smoke.pyAdd unit smoke test for dataset batching and BIO label alignment +114/-0

Add unit smoke test for dataset batching and BIO label alignment

• Creates a small JSONL fixture and a minimal character tokenizer to validate 'NerDataset' label placement (B/I tags, ignore index on special tokens) and 'ner_collate_fn' batch shapes/meta. Serves as a fast correctness guard for the training input pipeline.

tests/test_train_smoke.py

Documentation (1) +104 / -0

sft-run-1.mdAdd run-1 SFT execution recipe for Modal +104/-0

Add run-1 SFT execution recipe for Modal

• Documents preconditions, smoke/full commands (H100 with A100-80GB fallback), expected log evidence, checkpoint layout, and W&B project settings. Includes dataset size/step/time/cost estimates and a follow-up eval hook expectation.

docs/plan/sft-run-1.md

qodo-code-review · 2026-07-02T10:31:21Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)

1. ~~Full run timeout too short~~ ✓ Resolved 🐞 Bug ☼ Reliability

Description

run_h100/run_a100 set timeout=2 * HOURS but the documented full run is expected to take
~5.94–7.26 hours on H100, so --mode full will be terminated by Modal before completion. This makes
the primary “full run” workflow unreliable/outage-prone.

Code

scripts/modal_train.py[R342-348]

+@app.function(
+    image=image,
+    gpu="H100",
+    secrets=[modal.Secret.from_name(HF_SECRET_NAME), modal.Secret.from_name(WANDB_SECRET_NAME)],
+    volumes={str(DATA_DIR): data_volume},
+    timeout=2 * HOURS,
+)

Evidence

The plan doc explicitly estimates the full run duration at ~6–7 hours, while the Modal function
timeout is set to 2 hours, ensuring full runs will be killed early.

docs/plan/sft-run-1.md[69-73]
scripts/modal_train.py[342-348]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The Modal functions used for both smoke and full training are configured with a 2-hour timeout, but the planned full SFT run is expected to take ~6–7 hours (and potentially longer on A100). This will cause Modal to kill full runs before they finish.

## Issue Context
- The same Modal function handles both `mode=smoke` and `mode=full`.
- The plan doc in this PR estimates 3 epochs at 5.94–7.26 hours on H100.

## Fix Focus Areas
- scripts/modal_train.py[342-368]

### Suggested implementation direction
- Either:
 - Split into separate Modal functions for smoke vs full (e.g., `run_h100_smoke` with a short timeout and `run_h100_full` with a long timeout), and route based on `config.mode`, **or**
 - Increase the existing timeout to safely cover the full run (e.g., 10–12 hours for H100, potentially higher for A100 fallback).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Buffers full stdout in memory~~ ✓ Resolved 🐞 Bug ➹ Performance

Description

_run_full_training appends every subprocess stdout line into a list and only uses the last 40
lines, causing avoidable linear memory growth with log volume during multi-hour training. This can
degrade performance and can crash long/verbose runs due to excessive memory usage.

Code

scripts/modal_train.py[R317-339]

+    proc = subprocess.Popen(
+        cmd,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        text=True,
+        env=env,
+        bufsize=1,
+    )
+    lines: list[str] = []
+    assert proc.stdout is not None
+    for line in proc.stdout:
+        print(line, end="", flush=True)
+        lines.append(line)
+    proc.wait()
+    data_volume.commit()
+    return {
+        "mode": "full",
+        "run_name": run_name,
+        "returncode": proc.returncode,
+        "output_dir": str(output_dir),
+        "data": data_info,
+        "tail": "".join(lines[-40:]),
+    }

Evidence

The code currently accumulates all stdout lines in memory (lines.append(line)) during full
training; combined with the plan’s multi-hour runtime, this creates unnecessary and potentially
large memory growth.

scripts/modal_train.py[317-339]
docs/plan/sft-run-1.md[69-73]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`_run_full_training` stores all stdout lines from the training subprocess in a Python list even though it only returns a short tail. For long runs this creates linear memory growth with log output.

## Issue Context
- Full runs are expected to last hours per the run plan.
- The code returns `"tail": "".join(lines[-40:])` but still retains the entire `lines` list.

## Fix Focus Areas
- scripts/modal_train.py[317-339]

### Suggested implementation direction
- Replace `lines: list[str]` with `collections.deque(maxlen=N)` (e.g., N=2000) so only a bounded tail is retained.
- Or write logs to a file in `/data/` (Volume) and only keep a small in-memory tail for the return payload.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copilot

Pull request overview

Prepares the codebase for the first SFT run on Modal by adding a Modal training entrypoint, a concrete run plan, and a small smoke test to validate NerDataset label alignment and batching behavior.

Changes:

Add scripts/modal_train.py Modal entrypoint supporting a one-step smoke run and a full training run with dataset caching to a Modal Volume.
Add docs/plan/sft-run-1.md documenting commands, expected logs, dataset scale, and checkpoint layout for “SFT Run 1”.
Add tests/test_train_smoke.py smoke test covering dataset bucket indexing, label alignment, and ner_collate_fn batch shapes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
tests/test_train_smoke.py	Adds a small fixture-backed smoke test for `NerDataset`/`ner_collate_fn` label alignment and batch shapes.
scripts/modal_train.py	Introduces Modal job entrypoint for smoke/full training, HF dataset materialization into a Modal Volume, and checkpoint/metadata writing.
docs/plan/sft-run-1.md	Documents the operational recipe for the first SFT run (commands, expected evidence, cost/throughput, checkpoints).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+@app.function(
+    image=image,
+    gpu="H100",
+    secrets=[modal.Secret.from_name(HF_SECRET_NAME), modal.Secret.from_name(WANDB_SECRET_NAME)],
+    volumes={str(DATA_DIR): data_volume},
+    timeout=2 * HOURS,
+)


+@app.function(
+    image=image,
+    gpu="A100-80GB",
+    secrets=[modal.Secret.from_name(HF_SECRET_NAME), modal.Secret.from_name(WANDB_SECRET_NAME)],
+    volumes={str(DATA_DIR): data_volume},
+    timeout=2 * HOURS,
+)


+    if SMOKE_TRAIN_PATH.exists() and SMOKE_VAL_PATH.exists():
+        train_rows = sum(1 for _ in SMOKE_TRAIN_PATH.open("rb"))
+        val_rows = sum(1 for _ in SMOKE_VAL_PATH.open("rb"))
+        return SMOKE_TRAIN_PATH, SMOKE_VAL_PATH, {
+            "source": "modal_volume_cache",
+            "train_rows": train_rows,
+            "val_rows": val_rows,
+        }
+
+    from datasets import load_dataset
+
+    rows_needed = max(config.smoke_records, config.batch_size, 8)


+    if TRAIN_PATH.exists() and VAL_PATH.exists():
+        return TRAIN_PATH, VAL_PATH, {
+            "source": "modal_volume_cache",
+            "train_rows": sum(1 for _ in TRAIN_PATH.open("rb")),
+            "val_rows": sum(1 for _ in VAL_PATH.open("rb")),
+        }


+    data_volume.commit()
+    return TRAIN_PATH, VAL_PATH, {
+        "source": config.dataset_id,
+        "train_rows": sum(1 for _ in TRAIN_PATH.open("rb")),
+        "val_rows": sum(1 for _ in VAL_PATH.open("rb")),
+    }


Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

+    if SMOKE_TRAIN_PATH.exists() and SMOKE_VAL_PATH.exists():
+        train_rows = sum(1 for _ in SMOKE_TRAIN_PATH.open("rb"))
+        val_rows = sum(1 for _ in SMOKE_VAL_PATH.open("rb"))
+        return SMOKE_TRAIN_PATH, SMOKE_VAL_PATH, {
+            "source": "modal_volume_cache",
+            "train_rows": train_rows,
+            "val_rows": val_rows,
+        }
+
+    from datasets import load_dataset
+
+    rows_needed = max(config.smoke_records, config.batch_size, 8)


+    if TRAIN_PATH.exists() and VAL_PATH.exists():
+        return TRAIN_PATH, VAL_PATH, {
+            "source": "modal_volume_cache",
+            "train_rows": sum(1 for _ in TRAIN_PATH.open("rb")),
+            "val_rows": sum(1 for _ in VAL_PATH.open("rb")),
+        }


+    return TRAIN_PATH, VAL_PATH, {
+        "source": config.dataset_id,
+        "train_rows": sum(1 for _ in TRAIN_PATH.open("rb")),
+        "val_rows": sum(1 for _ in VAL_PATH.open("rb")),
+    }


+            "run_name": run_name,
+            "step": 1,
+            "checkpoint_path": str(step_dir),
+            "wandb_project": f"{config.wandb_entity}/{config.wandb_project}",


olety · 2026-07-02T14:54:42Z

ONERB-5 Modal smoke re-run after secrets became available did not reach the smoke step.

Secrets visible by name under profile oneiron-dev: huggingface, wandb.
Clean checkout: origin/main at e0b55ea.

Exact command attempted:

modal run scripts/modal_train.py --mode smoke --gpu H100 --max-steps 1 --batch-size 32 --run-name onerb-5-smoke

Run URL: https://modal.com/apps/oneiron-dev/main/ap-xQHJkO2LwetCfzdfrvj8oA

Result:

ModuleNotFoundError: No module named model

No ONERB5_SMOKE_STEP line was emitted. A module-mode routing attempt with the same args also failed with the same import error: https://modal.com/apps/oneiron-dev/main/ap-LBSFRffv9qrdvo3RUTxHAC

Flag updated: /Users/olety/Desktop/code/fable-queue/oneironer-wave/onerb-5.flag.md.

olety · 2026-07-02T15:46:45Z

ONERB-5 Modal smoke evidence (post-merge follow-up): the import blocker was fixed on main in 46d3c89 (PYTHONPATH=/app on the Modal image). 1-step H100 smoke now passes end-to-end.

Run URL: https://modal.com/apps/oneiron-dev/main/ap-h07kyJ1GAAZ3dAa7efND5r

ONERB5_SMOKE_STEP {"batch_shape": [32, 512], "checkpoint_path": "/data/checkpoints/sft-run-1/onerb-5-smoke/step-1", "loss": 4.1000895500183105, "run_name": "onerb-5-smoke", "source": "olety/oneiron-ner-labeled", "train_rows": 64, "val_rows": 8}

W&B run onerb-5-smoke synced to entity oneiron-dev (global_step / peak_vram_mb / train_loss logged). Checkpoint on Modal Volume ner-data. Flag retired.

Copilot AI review requested due to automatic review settings July 2, 2026 10:27

olety marked this pull request as ready for review July 2, 2026 10:27

Copilot started reviewing on behalf of olety July 2, 2026 10:28 View session

qodo-code-review Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread scripts/modal_train.py

Comment thread scripts/modal_train.py

Copilot AI reviewed Jul 2, 2026

View reviewed changes

olety added 4 commits July 2, 2026 19:34

Add first SFT Modal run prep

1e4700e

Fix Modal full run timeout

b85b102

Bound Modal full run log tail

836b0ce

Update SFT eval hook for merged ONERB-1

b237ad3

Copilot AI review requested due to automatic review settings July 2, 2026 10:35

olety force-pushed the codex/onerb-5 branch from 973340d to b237ad3 Compare July 2, 2026 10:35

Copilot started reviewing on behalf of olety July 2, 2026 10:36 View session

olety merged commit c34ab87 into main Jul 2, 2026
2 checks passed

olety deleted the codex/onerb-5 branch July 2, 2026 10:37

Copilot AI reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ONERB-5: prep first SFT Modal run#5

ONERB-5: prep first SFT Modal run#5
olety merged 4 commits into
mainfrom
codex/onerb-5

olety commented Jul 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Review limit reached

Uh oh!

qodo-code-review Bot commented Jul 2, 2026

Uh oh!

qodo-code-review Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

olety commented Jul 2, 2026

Uh oh!

olety commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

olety commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

qodo-code-review Bot commented Jul 2, 2026

PR Summary by Qodo

Uh oh!

qodo-code-review Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

olety commented Jul 2, 2026

Uh oh!

olety commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

olety commented Jul 2, 2026 •

edited

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

qodo-code-review Bot commented Jul 2, 2026 •

edited

Loading