-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[OPIK-6599] [SDK] fix: opik migrate fails loud on skipped items #6782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,7 +13,7 @@ | |
| import sys | ||
| import time | ||
| from pathlib import Path | ||
| from typing import Any, Iterator, Optional | ||
| from typing import Any, Dict, Iterator, Optional | ||
|
|
||
| import click | ||
| from rich.console import Console | ||
|
|
@@ -48,6 +48,11 @@ | |
| ) | ||
|
|
||
| console = Console() | ||
| # Dedicated stderr console for the loud-fail path so the SKIP_SUMMARY | ||
| # line lands on stderr without flipping the default console (OPIK-6599). | ||
| # Tests assert against this stream; CI gates can grep stderr without | ||
| # parsing the audit JSON. | ||
| _stderr_console = Console(stderr=True) | ||
|
|
||
| MIGRATE_CONTEXT_SETTINGS = {"help_option_names": ["-h", "--help"]} | ||
|
|
||
|
|
@@ -185,6 +190,91 @@ def _finalize_and_fail( | |
| sys.exit(1) | ||
|
|
||
|
|
||
| def _finalize_with_skips_or_ok( | ||
| audit: AuditLog, | ||
| audit_path: Path, | ||
| name: str, | ||
| target_label: str, | ||
| target_project: str, | ||
| elapsed_seconds: float, | ||
| ) -> None: | ||
| """Finalize the audit log, then either fail loud on skips or print the | ||
| happy-path message. | ||
|
|
||
| Per OPIK-6599: when the cascade emits any ``skip`` audit record, the | ||
| migrate is "succeeded but lossy" — the destination state has partial | ||
| data and is **not** rolled back. We finalize the audit to ``failed``, | ||
| print a SKIP_SUMMARY line to **stderr** so CI pipelines can grep | ||
| without parsing the JSON, and exit non-zero. Operators rely on the | ||
| audit log to know what made it across. | ||
| """ | ||
| skip_records = [ | ||
| action for action in audit.actions if action.get("status") == "skipped" | ||
| ] | ||
| if not skip_records: | ||
| audit.finalize("ok") | ||
| audit.write(audit_path) | ||
| elapsed = _format_elapsed(elapsed_seconds) | ||
| console.print( | ||
| f"[green]Migrated '{name}' into project '{target_project}' as " | ||
| f"'{target_label}'.[/green] Took {elapsed}. Audit log: {audit_path}" | ||
| ) | ||
| return | ||
|
Comment on lines
+211
to
+222
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SKIP_SUMMARY reads Finding type: Want Baz to fix this for you? Activate Fixer Other fix methodsPrompt for AI Agents
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Verified this against the audit-record shape and the bot is mistaken here. The existing
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the correction — you're right, I misread the audit record shape here. I'll save this to memory once the PR is merged. |
||
|
|
||
| # Aggregate counts by reason for the SKIP_SUMMARY line. The cascade | ||
| # summary record carries the totals too, but reading from skip records | ||
| # directly keeps the message decoupled from the summary record's shape. | ||
| totals: Dict[str, int] = { | ||
| "experiments_skipped": 0, | ||
| "items_skipped_missing_trace": 0, | ||
| "items_skipped_missing_item": 0, | ||
| } | ||
| reason_to_total_key = { | ||
| "experiment_recreate_returned_false": "experiments_skipped", | ||
| "items_missing_trace_remap": "items_skipped_missing_trace", | ||
| "items_missing_dataset_item_remap": "items_skipped_missing_item", | ||
| } | ||
|
Comment on lines
+233
to
+236
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The skip-aggregation loop reads Finding type: Want Baz to fix this for you? Activate Fixer Other fix methodsPrompt for AI Agents
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as my reply on the sibling thread above — false positive.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the correction — that makes sense. I’ll save this to memory once the PR is merged. |
||
| for record in skip_records: | ||
| total_key = reason_to_total_key.get(record.get("reason", "")) | ||
| if total_key is None: | ||
| continue | ||
| totals[total_key] += int(record.get("count", 0)) | ||
|
|
||
| total_skipped = sum(totals.values()) | ||
| audit.finalize("failed") | ||
| audit.write(audit_path) | ||
|
|
||
| elapsed = _format_elapsed(elapsed_seconds) | ||
| _stderr_console.print( | ||
| f"[red]opik migrate: {total_skipped} item{'s' if total_skipped != 1 else ''} " | ||
| f"skipped — destination state was NOT rolled back; see audit log: " | ||
| f"{audit_path}[/red]" | ||
| ) | ||
| _stderr_console.print( | ||
| f"SKIP_SUMMARY: " | ||
| f"experiments_skipped={totals['experiments_skipped']} " | ||
| f"items_skipped_missing_trace={totals['items_skipped_missing_trace']} " | ||
| f"items_skipped_missing_item={totals['items_skipped_missing_item']}" | ||
| ) | ||
| # High-level rollback hint. We deliberately don't ship a step-by-step | ||
| # CLI playbook here -- the audit log is the source of truth for what | ||
| # was actually created. Each ``ok`` action in ``audit.actions`` carries | ||
| # the destination entity id; an operator can grep the audit JSON to | ||
| # see exactly what landed in the destination project before deciding | ||
| # what to delete. Auto-rollback (a one-flag clean reverse) is tracked | ||
| # as a follow-up; this PR is the loud-fail mechanic only. | ||
| _stderr_console.print( | ||
| f"[yellow]To roll back manually: in project " | ||
| f"'{target_project}', delete the destination dataset " | ||
| f"'{target_label}' along with any experiments, optimizations, " | ||
| f"traces, and spans that were cascaded into it (the audit log " | ||
|
Comment on lines
+259
to
+270
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This rollback hint is printed via Finding type: Want Baz to fix this for you? Activate Fixer Other fix methodsPrompt for AI Agents
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Skipping — the rollback hint deliberately lives on stderr with the rest of the failure block:
These four lines are one cohesive failure summary. Splitting just line 3 off to stdout would scatter the failure narrative across two streams. CI that captures both streams via 🤖 Reply posted via /address-github-pr-comments
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, that makes sense — keeping the rollback hint on stderr preserves the single failure-summary block here. I’ll save this to memory once the PR is merged. |
||
| f"lists each created entity id); then rename the source " | ||
| f"'{name}_v1' back to '{name}'.[/yellow]" | ||
| ) | ||
| _stderr_console.print(f"[dim](after {elapsed})[/dim]") | ||
| sys.exit(1) | ||
|
|
||
|
|
||
| @migrate_group.command(name="dataset") | ||
| @click.argument("name", type=str) | ||
| @click.option( | ||
|
|
@@ -289,12 +379,13 @@ def migrate_dataset_command( | |
| elapsed_seconds=time.monotonic() - started_at, | ||
| ) | ||
|
|
||
| audit.finalize("ok") | ||
| audit.write(audit_path) | ||
| elapsed = _format_elapsed(time.monotonic() - started_at) | ||
| console.print( | ||
| f"[green]Migrated '{name}' into project '{to_project}' as '{plan.target_name}'.[/green] " | ||
| f"Took {elapsed}. Audit log: {audit_path}" | ||
| _finalize_with_skips_or_ok( | ||
| audit, | ||
| audit_path, | ||
| name=name, | ||
| target_label=plan.target_name, | ||
| target_project=to_project, | ||
| elapsed_seconds=time.monotonic() - started_at, | ||
| ) | ||
|
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing_trace_ids/missing_item_idsgrow unbounded before being trimmed in_record_skip— should we cap them at_SKIP_SAMPLE_LIMITduring collection? Also, should we add a test for the missing-trace negative path asserting askipaudit withreason == "items_missing_trace_remap"and the expected count/sample?Finding types:
Reliable test patternsBatch and stream| Severity: 🟢 LowWant Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit e624694 addressed this comment by capping the missing-trace/item sample lists as they are collected and passing the total counts to the new
_record_skiphelper so the audit record keeps accurate counts with a bounded sample. It also emits explicitskipaudit records for the missing-trace and missing-item reasons, including the newly added_SKIP_SAMPLE_LIMIT. The accompanying tests now cover the missing-trace negative path, asserting the audit reason, count, and sampled source IDs are as expected.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on the unbounded growth. Fixed in e624694 —
missing_trace_sample/missing_item_sampleare now capped at_SKIP_SAMPLE_LIMITduring collection (the appends are guarded bylen(...) < _SKIP_SAMPLE_LIMIT).countcontinues to come from the always-fully-incrementedresult.items_skipped_*counters, so the audit record still carries the true total even when the sample is capped.Also added the missing-trace negative-path test you asked for:
TestSkipAuditRecords.test_missing_trace_remap__emits_skip_record. It monkey-patches_copy_traces_and_spansso thetrace_id_remapis deliberately incomplete after the trace-copy phase (mirroring a production BE quirk where a trace gets dropped mid-cascade), then asserts the cascade emits exactly oneskiprecord withreason == "items_missing_trace_remap", the right count, and the orphan trace_id insample_source_ids.