Skip to content

ONERB-5: prep first SFT Modal run#5

Merged
olety merged 4 commits into
mainfrom
codex/onerb-5
Jul 2, 2026
Merged

ONERB-5: prep first SFT Modal run#5
olety merged 4 commits into
mainfrom
codex/onerb-5

Conversation

@olety

@olety olety commented Jul 2, 2026

Copy link
Copy Markdown
Member

No description provided.

Copilot AI review requested due to automatic review settings July 2, 2026 10:27
@olety olety marked this pull request as ready for review July 2, 2026 10:27
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 47 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 01a167cf-8fa7-4f66-ba57-d01bb3369dff

📥 Commits

Reviewing files that changed from the base of the PR and between c1a0bec and b237ad3.

📒 Files selected for processing (3)
  • docs/plan/sft-run-1.md
  • scripts/modal_train.py
  • tests/test_train_smoke.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/onerb-5

Comment @coderabbitai help to get the list of available commands.

@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

ONERB-5: Add Modal SFT run-1 recipe, entrypoint, and smoke unit test

✨ Enhancement 📝 Documentation 🧪 Tests 🕐 40+ Minutes

Grey Divider

AI Description

• Add Modal entrypoint to run SFT smoke step and full training on GPU.
• Cache private HF dataset splits and checkpoints in the ner-data Modal Volume.
• Document run commands/costs and add a dataset/label-alignment smoke unit test.
Diagram

graph TD
  A["CLI"] --> B["modal run"] --> C["Modal App"] --> D["scripts/modal_train.py"]
  D --> E{{"HF dataset"}}
  D --> G["model.train / smoke"] --> F[(ner-data)]
  G --> H{{"W&B"}}
  D --> F

  subgraph Legend
    direction LR
    _svc["Service/Script"] ~~~ _vol[(Volume)] ~~~ _ext{{"External"}}
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Use `model.train --max-steps 1` for smoke (no custom torch loop)
  • ➕ Single training entrypoint for both smoke and full runs
  • ➕ Exercise the exact Trainer/mix-sampling stack used in full training
  • ➖ More moving parts for a first “does it run at all” GPU smoke
  • ➖ Harder to guarantee an immediate checkpoint write after exactly one step
2. Pre-stage dataset JSONLs in the Modal Volume (no HF download at runtime)
  • ➕ Faster and more deterministic job startup
  • ➕ Avoids transient HF/network failures during training runs
  • ➖ Requires a separate ingestion job/process and operational coordination
  • ➖ Harder to ensure the volume contents match a specific dataset revision

Recommendation: Current approach is a good first-run setup: the dedicated one-step smoke validates tokenization/labeling/model forward/backward on real private rows with minimal overhead, and the full mode switches to a durable snapshot-based materialization for long training. Consider the model.train --max-steps 1 smoke alternative later if you want the smoke to cover the full Trainer stack, but keeping the lightweight smoke loop is a pragmatic reliability choice for ONERB-5.

Files changed (3) +619 / -0

Enhancement (1) +401 / -0
modal_train.pyIntroduce Modal runner for smoke step and full SFT training +401/-0

Introduce Modal runner for smoke step and full SFT training

• Adds a Modal app that mounts the 'ner-data' volume, pulls a private HF dataset (streaming for smoke; snapshot for full), and runs either a one-step torch smoke or full 'python -m model.train' as a subprocess. Writes checkpoints/metadata to the volume, logs key metrics to W&B, and emits structured 'ONERB5_*' log lines for evidence collection.

scripts/modal_train.py

Tests (1) +114 / -0
test_train_smoke.pyAdd unit smoke test for dataset batching and BIO label alignment +114/-0

Add unit smoke test for dataset batching and BIO label alignment

• Creates a small JSONL fixture and a minimal character tokenizer to validate 'NerDataset' label placement (B/I tags, ignore index on special tokens) and 'ner_collate_fn' batch shapes/meta. Serves as a fast correctness guard for the training input pipeline.

tests/test_train_smoke.py

Documentation (1) +104 / -0
sft-run-1.mdAdd run-1 SFT execution recipe for Modal +104/-0

Add run-1 SFT execution recipe for Modal

• Documents preconditions, smoke/full commands (H100 with A100-80GB fallback), expected log evidence, checkpoint layout, and W&B project settings. Includes dataset size/step/time/cost estimates and a follow-up eval hook expectation.

docs/plan/sft-run-1.md

@qodo-code-review

qodo-code-review Bot commented Jul 2, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)

Grey Divider


Action required

1. Full run timeout too short ✓ Resolved 🐞 Bug ☼ Reliability
Description
run_h100/run_a100 set timeout=2 * HOURS but the documented full run is expected to take
~5.94–7.26 hours on H100, so --mode full will be terminated by Modal before completion. This makes
the primary “full run” workflow unreliable/outage-prone.
Code

scripts/modal_train.py[R342-348]

+@app.function(
+    image=image,
+    gpu="H100",
+    secrets=[modal.Secret.from_name(HF_SECRET_NAME), modal.Secret.from_name(WANDB_SECRET_NAME)],
+    volumes={str(DATA_DIR): data_volume},
+    timeout=2 * HOURS,
+)
Evidence
The plan doc explicitly estimates the full run duration at ~6–7 hours, while the Modal function
timeout is set to 2 hours, ensuring full runs will be killed early.

docs/plan/sft-run-1.md[69-73]
scripts/modal_train.py[342-348]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The Modal functions used for both smoke and full training are configured with a 2-hour timeout, but the planned full SFT run is expected to take ~6–7 hours (and potentially longer on A100). This will cause Modal to kill full runs before they finish.

## Issue Context
- The same Modal function handles both `mode=smoke` and `mode=full`.
- The plan doc in this PR estimates 3 epochs at 5.94–7.26 hours on H100.

## Fix Focus Areas
- scripts/modal_train.py[342-368]

### Suggested implementation direction
- Either:
 - Split into separate Modal functions for smoke vs full (e.g., `run_h100_smoke` with a short timeout and `run_h100_full` with a long timeout), and route based on `config.mode`, **or**
 - Increase the existing timeout to safely cover the full run (e.g., 10–12 hours for H100, potentially higher for A100 fallback).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Buffers full stdout in memory ✓ Resolved 🐞 Bug ➹ Performance
Description
_run_full_training appends every subprocess stdout line into a list and only uses the last 40
lines, causing avoidable linear memory growth with log volume during multi-hour training. This can
degrade performance and can crash long/verbose runs due to excessive memory usage.
Code

scripts/modal_train.py[R317-339]

+    proc = subprocess.Popen(
+        cmd,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        text=True,
+        env=env,
+        bufsize=1,
+    )
+    lines: list[str] = []
+    assert proc.stdout is not None
+    for line in proc.stdout:
+        print(line, end="", flush=True)
+        lines.append(line)
+    proc.wait()
+    data_volume.commit()
+    return {
+        "mode": "full",
+        "run_name": run_name,
+        "returncode": proc.returncode,
+        "output_dir": str(output_dir),
+        "data": data_info,
+        "tail": "".join(lines[-40:]),
+    }
Evidence
The code currently accumulates all stdout lines in memory (lines.append(line)) during full
training; combined with the plan’s multi-hour runtime, this creates unnecessary and potentially
large memory growth.

scripts/modal_train.py[317-339]
docs/plan/sft-run-1.md[69-73]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`_run_full_training` stores all stdout lines from the training subprocess in a Python list even though it only returns a short tail. For long runs this creates linear memory growth with log output.

## Issue Context
- Full runs are expected to last hours per the run plan.
- The code returns `"tail": "".join(lines[-40:])` but still retains the entire `lines` list.

## Fix Focus Areas
- scripts/modal_train.py[317-339]

### Suggested implementation direction
- Replace `lines: list[str]` with `collections.deque(maxlen=N)` (e.g., N=2000) so only a bounded tail is retained.
- Or write logs to a file in `/data/` (Volume) and only keep a small in-memory tail for the return payload.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread scripts/modal_train.py
Comment thread scripts/modal_train.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prepares the codebase for the first SFT run on Modal by adding a Modal training entrypoint, a concrete run plan, and a small smoke test to validate NerDataset label alignment and batching behavior.

Changes:

  • Add scripts/modal_train.py Modal entrypoint supporting a one-step smoke run and a full training run with dataset caching to a Modal Volume.
  • Add docs/plan/sft-run-1.md documenting commands, expected logs, dataset scale, and checkpoint layout for “SFT Run 1”.
  • Add tests/test_train_smoke.py smoke test covering dataset bucket indexing, label alignment, and ner_collate_fn batch shapes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
tests/test_train_smoke.py Adds a small fixture-backed smoke test for NerDataset/ner_collate_fn label alignment and batch shapes.
scripts/modal_train.py Introduces Modal job entrypoint for smoke/full training, HF dataset materialization into a Modal Volume, and checkpoint/metadata writing.
docs/plan/sft-run-1.md Documents the operational recipe for the first SFT run (commands, expected evidence, cost/throughput, checkpoints).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/modal_train.py
Comment on lines +342 to +348
@app.function(
image=image,
gpu="H100",
secrets=[modal.Secret.from_name(HF_SECRET_NAME), modal.Secret.from_name(WANDB_SECRET_NAME)],
volumes={str(DATA_DIR): data_volume},
timeout=2 * HOURS,
)
Comment thread scripts/modal_train.py
Comment on lines +356 to +362
@app.function(
image=image,
gpu="A100-80GB",
secrets=[modal.Secret.from_name(HF_SECRET_NAME), modal.Secret.from_name(WANDB_SECRET_NAME)],
volumes={str(DATA_DIR): data_volume},
timeout=2 * HOURS,
)
Comment thread scripts/modal_train.py
Comment on lines +84 to +95
if SMOKE_TRAIN_PATH.exists() and SMOKE_VAL_PATH.exists():
train_rows = sum(1 for _ in SMOKE_TRAIN_PATH.open("rb"))
val_rows = sum(1 for _ in SMOKE_VAL_PATH.open("rb"))
return SMOKE_TRAIN_PATH, SMOKE_VAL_PATH, {
"source": "modal_volume_cache",
"train_rows": train_rows,
"val_rows": val_rows,
}

from datasets import load_dataset

rows_needed = max(config.smoke_records, config.batch_size, 8)
Comment thread scripts/modal_train.py
Comment on lines +125 to +130
if TRAIN_PATH.exists() and VAL_PATH.exists():
return TRAIN_PATH, VAL_PATH, {
"source": "modal_volume_cache",
"train_rows": sum(1 for _ in TRAIN_PATH.open("rb")),
"val_rows": sum(1 for _ in VAL_PATH.open("rb")),
}
Comment thread scripts/modal_train.py
Comment on lines +153 to +158
data_volume.commit()
return TRAIN_PATH, VAL_PATH, {
"source": config.dataset_id,
"train_rows": sum(1 for _ in TRAIN_PATH.open("rb")),
"val_rows": sum(1 for _ in VAL_PATH.open("rb")),
}
Copilot AI review requested due to automatic review settings July 2, 2026 10:35
@olety olety merged commit c34ab87 into main Jul 2, 2026
2 checks passed
@olety olety deleted the codex/onerb-5 branch July 2, 2026 10:37

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread scripts/modal_train.py
Comment on lines +87 to +98
if SMOKE_TRAIN_PATH.exists() and SMOKE_VAL_PATH.exists():
train_rows = sum(1 for _ in SMOKE_TRAIN_PATH.open("rb"))
val_rows = sum(1 for _ in SMOKE_VAL_PATH.open("rb"))
return SMOKE_TRAIN_PATH, SMOKE_VAL_PATH, {
"source": "modal_volume_cache",
"train_rows": train_rows,
"val_rows": val_rows,
}

from datasets import load_dataset

rows_needed = max(config.smoke_records, config.batch_size, 8)
Comment thread scripts/modal_train.py
Comment on lines +128 to +133
if TRAIN_PATH.exists() and VAL_PATH.exists():
return TRAIN_PATH, VAL_PATH, {
"source": "modal_volume_cache",
"train_rows": sum(1 for _ in TRAIN_PATH.open("rb")),
"val_rows": sum(1 for _ in VAL_PATH.open("rb")),
}
Comment thread scripts/modal_train.py
Comment on lines +157 to +161
return TRAIN_PATH, VAL_PATH, {
"source": config.dataset_id,
"train_rows": sum(1 for _ in TRAIN_PATH.open("rb")),
"val_rows": sum(1 for _ in VAL_PATH.open("rb")),
}
Comment thread scripts/modal_train.py
"run_name": run_name,
"step": 1,
"checkpoint_path": str(step_dir),
"wandb_project": f"{config.wandb_entity}/{config.wandb_project}",
@olety

olety commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

ONERB-5 Modal smoke re-run after secrets became available did not reach the smoke step.

Secrets visible by name under profile oneiron-dev: huggingface, wandb.
Clean checkout: origin/main at e0b55ea.

Exact command attempted:

modal run scripts/modal_train.py --mode smoke --gpu H100 --max-steps 1 --batch-size 32 --run-name onerb-5-smoke

Run URL: https://modal.com/apps/oneiron-dev/main/ap-xQHJkO2LwetCfzdfrvj8oA

Result:

ModuleNotFoundError: No module named model

No ONERB5_SMOKE_STEP line was emitted. A module-mode routing attempt with the same args also failed with the same import error: https://modal.com/apps/oneiron-dev/main/ap-LBSFRffv9qrdvo3RUTxHAC

Flag updated: /Users/olety/Desktop/code/fable-queue/oneironer-wave/onerb-5.flag.md.

@olety

olety commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

ONERB-5 Modal smoke evidence (post-merge follow-up): the import blocker was fixed on main in 46d3c89 (PYTHONPATH=/app on the Modal image). 1-step H100 smoke now passes end-to-end.

Run URL: https://modal.com/apps/oneiron-dev/main/ap-h07kyJ1GAAZ3dAa7efND5r

ONERB5_SMOKE_STEP {"batch_shape": [32, 512], "checkpoint_path": "/data/checkpoints/sft-run-1/onerb-5-smoke/step-1", "loss": 4.1000895500183105, "run_name": "onerb-5-smoke", "source": "olety/oneiron-ner-labeled", "train_rows": 64, "val_rows": 8}

W&B run onerb-5-smoke synced to entity oneiron-dev (global_step / peak_vram_mb / train_loss logged). Checkpoint on Modal Volume ner-data. Flag retired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants