Add MLflow tracking hook by debu-sinha · Pull Request #3433 · UKGovernmentBEIS/inspect_ai

debu-sinha · 2026-03-07T07:22:29Z

Resolves #3417

Adds an MLflow tracking hook that logs Inspect evaluations as MLflow experiments. Follows the same examples/hooks/ pattern as the existing W&B/Weave hook.

Run hierarchy:

Parent MLflow run per eval() call
Nested child run per task
Step metrics per sample (scores, timing)
Real-time event metrics via on_sample_event (model calls, tool usage)

Usage

Set MLFLOW_TRACKING_URI and import the hook:

import os
os.environ["MLFLOW_TRACKING_URI"] = "http://localhost:5000"

from examples.hooks.mlflow_tracking import MlflowTrackingHooks  # noqa: F401

from inspect_ai import Task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate

task = Task(
    dataset=[
        Sample(input="What is 2+2? Reply with just the number.", target="4"),
        Sample(input="What is 7*8? Reply with just the number.", target="56"),
    ],
    solver=generate(),
    scorer=match(),
    name="math_arithmetic",
)

logs = eval(task, model="openai/gpt-4o-mini")

The hook activates automatically when MLFLOW_TRACKING_URI is set. Optional MLFLOW_EXPERIMENT_NAME defaults to "inspect_ai".

What gets logged

Inspect event	MLflow action
`on_run_start`	Create parent run with eval metadata tags
`on_task_start`	Create nested child run, log task params (model, dataset, solver, scorer)
`on_sample_end`	Log per-sample scores as step metrics
`on_sample_event`	Log real-time model call tokens/timing and tool call details as step metrics
`on_model_usage`	Accumulate token usage across model calls
`on_task_end`	Log aggregate scores, total token usage, event counts; close child run
`on_run_end`	Close parent run with FINISHED/FAILED status

The on_sample_event integration tracks ModelEvent (input/output tokens, call duration) and ToolEvent (function name, error flag, duration) as they happen during evaluation, giving step-by-step visibility into model and tool behavior in the MLflow UI.

Screenshots

Tested locally against a real MLflow server with openai/gpt-4o-mini (2 tasks, 8 samples).

Task run overview showing metrics (scores, token usage, event counts) and tags (task name, model, dataset size):

Model metrics tab showing real-time step charts from on_sample_event (input/output tokens per model call, call duration):

Parent run with nested child runs per task:

Testing

12 unit tests covering the full lifecycle, event handling, score conversion, and edge cases:

PASSED test_enabled_requires_tracking_uri
PASSED test_run_lifecycle
PASSED test_run_end_with_exception
PASSED test_task_lifecycle
PASSED test_sample_scores_logged_as_step_metrics
PASSED test_model_usage_accumulation
PASSED test_sample_without_active_task_is_ignored
PASSED test_sample_event_model_call
PASSED test_sample_event_tool_call
PASSED test_sample_event_without_active_task_is_ignored
PASSED test_event_counts_logged_on_task_end
PASSED test_score_to_numeric_conversion

All 27 existing hooks tests continue to pass.

Adds an MLflow tracking hook that logs Inspect evaluation runs as MLflow experiments. Creates a parent run per eval with nested child runs per task, logging parameters, scores, token usage, and timing. Integrates on_sample_event for real-time model call and tool usage tracking as step metrics during evaluation. Follows the same examples/hooks pattern as the existing W&B/Weave hook. Opt-in via MLFLOW_TRACKING_URI environment variable. Signed-off-by: debu-sinha <debusinha2009@gmail.com>

jjallaire merged commit adbb77e into UKGovernmentBEIS:main Mar 7, 2026
7 of 14 checks passed

debu-sinha mentioned this pull request Mar 23, 2026

Add blog: Tracking and Debugging AI Safety Evaluations with Inspect AI and MLflow mlflow/mlflow-website#533

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MLflow tracking hook#3433

Add MLflow tracking hook#3433
jjallaire merged 1 commit into
UKGovernmentBEIS:mainfrom
debu-sinha:feature/mlflow-tracking-hook

debu-sinha commented Mar 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

debu-sinha commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

What gets logged

Screenshots

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

debu-sinha commented Mar 7, 2026 •

edited

Loading