chore(llmobs): support running experiment evals #13994

Yun-Kim · 2025-07-14T23:47:48Z

MLOB-3269

Adds support for running experiment evaluator functions, and merging task/evaluator results into one object.

Writing result objects to datadog will come in a future PR.

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

github-actions · 2025-07-14T23:48:20Z

CODEOWNERS have been resolved as:

ddtrace/llmobs/_constants.py                                            @DataDog/ml-observability
ddtrace/llmobs/_experiment.py                                           @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_writer.py                                               @DataDog/ml-observability

github-actions · 2025-07-15T00:09:57Z

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 284 ± 4 ms.

The average import time from base is: 288 ± 4 ms.

The import time difference between this PR and base is: -3.3 ± 0.2 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 2.114 ms (0.74%)

ddtrace.bootstrap.sitecustomize 1.429 ms (0.50%)

ddtrace.bootstrap.preload 1.429 ms (0.50%)

ddtrace.internal.remoteconfig.client 0.682 ms (0.24%)

ddtrace 0.685 ms (0.24%)

ddtrace.internal._unpatched 0.032 ms (0.01%)

json 0.032 ms (0.01%)

json.decoder 0.032 ms (0.01%)

re 0.032 ms (0.01%)

enum 0.032 ms (0.01%)

types 0.032 ms (0.01%)

pr-commenter · 2025-07-15T00:34:49Z

Benchmarks

Benchmark execution time: 2025-07-15 02:19:22

Comparing candidate commit 25cf67f in PR branch yunkim/dne-run-evals with baseline commit c70bb0d in branch yunkim/dne-run-tasks.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 548 metrics, 2 unstable metrics.

Yun-Kim added 2 commits July 14, 2025 19:29

WIP run tasks, experiment tracing

a8db6bd

Make DatasetRecords input_data type more strict, add tests

6937e9c

Yun-Kim added the changelog/no-changelog A changelog entry is not required for this PR. label Jul 14, 2025

Refactor, typing

c70bb0d

Yun-Kim changed the base branch from main to yunkim/dne-run-tasks July 15, 2025 00:57

Implement evals and merging results

269dc00

Yun-Kim force-pushed the yunkim/dne-run-evals branch from cbf3ce6 to 269dc00 Compare July 15, 2025 01:00

Typing

25cf67f

Base automatically changed from yunkim/dne-run-tasks to main July 15, 2025 04:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(llmobs): support running experiment evals #13994

chore(llmobs): support running experiment evals #13994

Yun-Kim commented Jul 14, 2025 •

edited by jira bot

Loading

Uh oh!

github-actions bot commented Jul 14, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 15, 2025 •

edited

Loading

Uh oh!

pr-commenter bot commented Jul 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

chore(llmobs): support running experiment evals #13994

Are you sure you want to change the base?

chore(llmobs): support running experiment evals #13994

Conversation

Yun-Kim commented Jul 14, 2025 • edited by jira bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Reviewer Checklist

Uh oh!

github-actions bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bootstrap import analysis

Summary

Import time breakdown

Uh oh!

pr-commenter bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

Uh oh!

Yun-Kim commented Jul 14, 2025 •

edited by jira bot

Loading

github-actions bot commented Jul 14, 2025 •

edited

Loading

github-actions bot commented Jul 15, 2025 •

edited

Loading

pr-commenter bot commented Jul 15, 2025 •

edited

Loading