Skip to content

Add MLflow tracking hook#3433

Merged
jjallaire merged 1 commit into
UKGovernmentBEIS:mainfrom
debu-sinha:feature/mlflow-tracking-hook
Mar 7, 2026
Merged

Add MLflow tracking hook#3433
jjallaire merged 1 commit into
UKGovernmentBEIS:mainfrom
debu-sinha:feature/mlflow-tracking-hook

Conversation

@debu-sinha
Copy link
Copy Markdown
Contributor

@debu-sinha debu-sinha commented Mar 7, 2026

Resolves #3417

Adds an MLflow tracking hook that logs Inspect evaluations as MLflow experiments. Follows the same examples/hooks/ pattern as the existing W&B/Weave hook.

Run hierarchy:

  • Parent MLflow run per eval() call
  • Nested child run per task
  • Step metrics per sample (scores, timing)
  • Real-time event metrics via on_sample_event (model calls, tool usage)

Usage

Set MLFLOW_TRACKING_URI and import the hook:

import os
os.environ["MLFLOW_TRACKING_URI"] = "http://localhost:5000"

from examples.hooks.mlflow_tracking import MlflowTrackingHooks  # noqa: F401

from inspect_ai import Task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate

task = Task(
    dataset=[
        Sample(input="What is 2+2? Reply with just the number.", target="4"),
        Sample(input="What is 7*8? Reply with just the number.", target="56"),
    ],
    solver=generate(),
    scorer=match(),
    name="math_arithmetic",
)

logs = eval(task, model="openai/gpt-4o-mini")

The hook activates automatically when MLFLOW_TRACKING_URI is set. Optional MLFLOW_EXPERIMENT_NAME defaults to "inspect_ai".

What gets logged

Inspect event MLflow action
on_run_start Create parent run with eval metadata tags
on_task_start Create nested child run, log task params (model, dataset, solver, scorer)
on_sample_end Log per-sample scores as step metrics
on_sample_event Log real-time model call tokens/timing and tool call details as step metrics
on_model_usage Accumulate token usage across model calls
on_task_end Log aggregate scores, total token usage, event counts; close child run
on_run_end Close parent run with FINISHED/FAILED status

The on_sample_event integration tracks ModelEvent (input/output tokens, call duration) and ToolEvent (function name, error flag, duration) as they happen during evaluation, giving step-by-step visibility into model and tool behavior in the MLflow UI.

Screenshots

Tested locally against a real MLflow server with openai/gpt-4o-mini (2 tasks, 8 samples).

Task run overview showing metrics (scores, token usage, event counts) and tags (task name, model, dataset size):

02-math-run-overview

Model metrics tab showing real-time step charts from on_sample_event (input/output tokens per model call, call duration):
04-math-model-metrics

Parent run with nested child runs per task:

08-parent-run

Testing

12 unit tests covering the full lifecycle, event handling, score conversion, and edge cases:

PASSED test_enabled_requires_tracking_uri
PASSED test_run_lifecycle
PASSED test_run_end_with_exception
PASSED test_task_lifecycle
PASSED test_sample_scores_logged_as_step_metrics
PASSED test_model_usage_accumulation
PASSED test_sample_without_active_task_is_ignored
PASSED test_sample_event_model_call
PASSED test_sample_event_tool_call
PASSED test_sample_event_without_active_task_is_ignored
PASSED test_event_counts_logged_on_task_end
PASSED test_score_to_numeric_conversion

All 27 existing hooks tests continue to pass.

Adds an MLflow tracking hook that logs Inspect evaluation runs as
MLflow experiments. Creates a parent run per eval with nested child
runs per task, logging parameters, scores, token usage, and timing.

Integrates on_sample_event for real-time model call and tool usage
tracking as step metrics during evaluation.

Follows the same examples/hooks pattern as the existing W&B/Weave hook.
Opt-in via MLFLOW_TRACKING_URI environment variable.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MLflow integration hook for experiment tracking

2 participants