feat(llmobs): add datasets and experiments features #13314

jjxct · 2025-05-01T18:21:19Z

[WIP] Add proper description

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

ddtrace/llmobs/experimentation/_dataset.py

ddtrace/llmobs/experimentation/_decorators.py

ddtrace/llmobs/experimentation/_dataset.py

github-actions · 2025-05-01T18:22:01Z

CODEOWNERS have been resolved as:

ddtrace/llmobs/experimentation/__init__.py                              @DataDog/ml-observability
ddtrace/llmobs/experimentation/_config.py                               @DataDog/ml-observability
ddtrace/llmobs/experimentation/_dataset.py                              @DataDog/ml-observability
ddtrace/llmobs/experimentation/_decorators.py                           @DataDog/ml-observability
ddtrace/llmobs/experimentation/_experiment.py                           @DataDog/ml-observability
ddtrace/llmobs/experimentation/utils/_exceptions.py                     @DataDog/ml-observability
ddtrace/llmobs/experimentation/utils/_http.py                           @DataDog/ml-observability
ddtrace/llmobs/experimentation/utils/_ui.py                             @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_init_implicit_pull.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_init_implicit_pull_existing.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_init_implicit_pull_nonexistent.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_meals_workouts_setup.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_empty.yaml         @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_for_repr_test.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_for_repr_unsynced_test.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_for_sync_tests.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_latest.yaml        @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_latest_meals_workouts.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_meals_workouts.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_nonexistent.yaml   @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_nonexistent_version.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_nonexistent_version_meals_workouts.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_specific_version.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_specific_version_meals_workouts.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_collision_new_version.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_collision_no_flags.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_collision_overwrite.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_collision_setup.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_large_chunking.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_new.yaml           @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_adds.yaml   @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_deletes.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_mixed.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_new_version.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_no_change.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_overwrite.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_updates.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_push_summary_metric_boolean.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_push_summary_metric_categorical.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_push_summary_metric_numeric.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_repr_full_run.yaml   @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_results_repr_summaries.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_evals_error_no_raise.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_evals_override.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_evals_success.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_full_eval_error.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_full_success.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_full_task_error.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_summary_metrics.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_summary_metrics_push_fail.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_task_error_no_raise.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_task_error_raise.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_task_sample.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_task_setup.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_task_with_config.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_setup_pull_dataset.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_setup_push_dataset.yaml  @DataDog/ml-observability
tests/llmobs/experiments_files/delimiter.tsv                            @DataDog/ml-observability
tests/llmobs/experiments_files/empty.csv                                @DataDog/ml-observability
tests/llmobs/experiments_files/header_only.csv                          @DataDog/ml-observability
tests/llmobs/experiments_files/malformed.csv                            @DataDog/ml-observability
tests/llmobs/experiments_files/multi.csv                                @DataDog/ml-observability
tests/llmobs/experiments_files/simple.csv                               @DataDog/ml-observability
tests/llmobs/test_experimentation_config.py                             @DataDog/ml-observability
tests/llmobs/test_experimentation_dataset.py                            @DataDog/ml-observability
tests/llmobs/test_experimentation_decorators.py                         @DataDog/ml-observability
tests/llmobs/test_experimentation_experiment.py                         @DataDog/ml-observability
ddtrace/llmobs/_constants.py                                            @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_utils.py                                                @DataDog/ml-observability
ddtrace/llmobs/_writer.py                                               @DataDog/ml-observability
tests/llmobs/test_utils.py                                              @DataDog/ml-observability

github-actions · 2025-05-01T18:42:20Z

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 279 ± 4 ms.

The average import time from base is: 280 ± 2 ms.

The import time difference between this PR and base is: -0.9 ± 0.1 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.874 ms (0.67%)

ddtrace.bootstrap.sitecustomize 1.199 ms (0.43%)

ddtrace.bootstrap.preload 1.199 ms (0.43%)

ddtrace.internal.remoteconfig.client 0.636 ms (0.23%)

ddtrace 0.675 ms (0.24%)

ddtrace.internal._unpatched 0.031 ms (0.01%)

json 0.031 ms (0.01%)

json.decoder 0.031 ms (0.01%)

re 0.031 ms (0.01%)

enum 0.031 ms (0.01%)

types 0.031 ms (0.01%)

pr-commenter · 2025-05-02T19:20:40Z

Benchmarks

Benchmark execution time: 2025-07-01 23:14:14

Comparing candidate commit 1d5f1b7 in PR branch llm-experiments-rebase with baseline commit 563f14b in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 568 metrics, 3 unstable metrics.

scenario:span-add-metrics

🟩 max_rss_usage [-362.138MB; -294.558MB] or [-38.402%; -31.236%]

Kyle-Verhoog · 2025-05-20T03:04:48Z

ddtrace/llmobs/experimentation/_config.py

+    if site is None:
+        site = os.getenv("DD_SITE")
+        if site is None:
+             raise ConfigurationError(


we say in the docstring that this defaults to "datadoghq.com" but raise here. I think we should default it and not raise here.

Kyle-Verhoog · 2025-05-20T03:14:02Z

ddtrace/llmobs/experimentation/_dataset.py

+from .utils._exceptions import DatasetFileError
+
+if TYPE_CHECKING:
+    import pandas as pd


I think this could interfere with a user's typechecker if they don't have pandas installed when type checking their own app. We should still guard this with a try

Kyle-Verhoog · 2025-05-20T03:15:39Z

ddtrace/llmobs/experimentation/_dataset.py

+        self.add(record)
+        return self
+
+    def add(self, record: Dict[str, Union[str, Dict[str, Any]]]) -> None:


defining a record type could drastically improve usability here.

class Record(TypedDict): input: str expected_output: Dict[str, JsonType]

for example

Kyle-Verhoog · 2025-06-09T19:29:37Z

ddtrace/llmobs/_constants.py

+
+# Experiments related
+EXPECTED_OUTPUT = "_ml_obs.meta.input.expected_output"
+EXPERIMENT_INPUT = "_ml_obs.meta.input"


confused as to why we need new fields for input and output here but maybe it will become obvious from further reading

still not obvious after further reading, @gary-huang can we just reuse meta.input.value for the span I/O? Or is there any specialized I/O formatting/structure for experiments spans

Kyle-Verhoog · 2025-06-09T19:30:52Z

ddtrace/llmobs/_llmobs.py

@@ -241,6 +253,12 @@ def _llmobs_tags(span: Span, ml_app: str, session_id: Optional[str] = None) -> L
            "language": "python",
            "error": span.error,
        }
+
+        # Add experiment_id from baggage if present
+        experiment_id = span.context.get_baggage_item(EXPERIMENT_ID_BAGGAGE_KEY)


we should just look generally in the context for the experiment ID to follow the same paradigm we do for parent id and mlobs trace id

Kyle-Verhoog · 2025-06-11T14:21:23Z

ddtrace/llmobs/experimentation/utils/_ui.py

let's separate out this ui code and printing logic to an additional PR. There's too much going on here

Kyle-Verhoog · 2025-06-11T14:21:38Z

docker/.python-version

this should be reverted

Kyle-Verhoog · 2025-06-11T14:23:00Z

tests/llmobs/experiments_files/delimiter.tsv

we should inline these file contents to the tests

Kyle-Verhoog · 2025-06-11T14:26:18Z

tests/llmobs/test_experimentation_config.py

+def reset_config_state():
+    """Resets global configuration state before each test."""
+    # Reset global state variables to defaults
+    config._IS_INITIALIZED = False


these values aren't kept in sync with the defaults, definitely want to move away from the globals

Kyle-Verhoog · 2025-06-11T14:29:41Z

tests/llmobs/test_experimentation_experiment.py

+@pytest.fixture(scope="module", autouse=True)
+def init_llmobs(experiments_vcr):
+    # Use the provided keys directly. VCR filtering handles redaction.
+    api_key = DD_API_KEY


definitely don't want to use globals in the tests

setting seems to have been introduced to make testing easier

Yun-Kim · 2025-07-07T21:31:23Z

ddtrace/llmobs/_writer.py

+            if asbool(os.getenv("DD_EXPERIMENTS_RUNNER_ENABLED")):
+                data["_dd.scope"] = "experiments"


Why are we treating experiments spans differently?

it's under a different scope on the track

Yun-Kim · 2025-07-07T22:00:13Z

ddtrace/llmobs/experimentation/_experiment.py

+        def process_row(idx_row):
+            idx, row = idx_row
+            start_time = time.time()
+            with LLMObs._experiment(name=self.task.__name__, experiment_id=self._datadog_experiment_id) as span:


Moving this to cover the entire run_task method

Yun-Kim · 2025-07-07T22:30:44Z

ddtrace/llmobs/experimentation/_experiment.py

+        LLMObs.flush()
+        time.sleep(API_PROCESSING_TIME_SLEEP)


Why sleep and flush here?

flushing to get the remaining items in the writers to be submitted, the sleep is used to account for backend replication lag (we can print a link to the experiment / dataset immediately but if the user clicks on it immediately it may not be ready in the backend)

we should do away with the sleep and do something better

Flush is to ensure spans get submitted if in serverless env (task ends). Fine with force flushing for now, but sleeping is weird

Should replace sleep with more verbose message (your trace/experiment will be available momentarily at ...)

weird question but here is just for submitting spans no? Why would we need to print a link to the experiment here? Shouldn't we be printing the link once the evals are finished/submitted?

I randomly stumbled here, but if y'all identify weird bits of logic to accommodate back-end behaviors, let me know and we can improve it. I agree that putting a sleep anywhere is very weird and if it's just to generate a link, we should remove it. These small nice to haves can come later.

Adds skeleton code for Experiments, experiment tasks, and experiment evaluator classes/decorators. Implementation of experiment run() has been left out for a follow-up PR. Basic structure of Experiments: - Call LLMObs.experiment_task as a decorator to wrap a task function (must have `input` as an arg) - Call LLMObs.experiment_evaluator as a decorator to wrap an evaluator function (must have `input/output/expected_output` as args) - Create a Dataset - Create an Experiment(name: str, task, dataset, evaluators, description, config) - Call experiment.run(...) Some concerns: - Should experiment task/evaluator decorators support async/generator methods? Currently (and based on #13314) it only supports sync methods. - The ExperimentTask wrapper class requires `input` as an arg name, which shadows Python builtins. ## Checklist - [x] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

brettlangdon

We need to bring back docker/.python-version

brettlangdon · 2025-07-11T18:53:31Z

docker/.python-version

-3.12
-3.8
-3.9
-3.10


why are we deleting this? this is going to break things.

fix conflicts

cbf6808

jjxct requested a review from a team as a code owner May 1, 2025 18:21

datadog-datadog-prod-us1 bot reviewed May 1, 2025

View reviewed changes

ddtrace/llmobs/experimentation/_dataset.py Outdated Show resolved Hide resolved

ddtrace/llmobs/experimentation/_decorators.py Outdated Show resolved Hide resolved

ddtrace/llmobs/experimentation/_dataset.py Outdated Show resolved Hide resolved

jjxct added 3 commits May 1, 2025 14:25

fix code quality violations

f7aea75

add line to constants

f247e9c

fix code quality issue

abe7934

fix tests

d151bc6

jjxct added 5 commits May 7, 2025 11:25

update tests

5c5d520

fix pushing

1893c37

temp

a5841a9

fix bug

9884fc7

fixing versioning

a648fe5

jjxct requested a review from a team as a code owner May 19, 2025 19:16

jjxct requested a review from erikayasuda May 19, 2025 19:16

fix versioning

91ee6af

Kyle-Verhoog requested changes Jun 11, 2025

View reviewed changes

jjxct and others added 8 commits June 11, 2025 13:28

pull specific version and csv file limits

df93380

accept DD_APP_KEY and log warnings for DD_APPLICATION_KEY (#13787)

4c5505e

wip

884152a

fmt

4f4a4e5

Merge remote-tracking branch 'origin' into llm-experiments-rebase

312ecfe

more fmt

dc7e8e5

remove run_locally

f2bc82e

setting seems to have been introduced to make testing easier

add and fix some typing

1d5f1b7

brettlangdon closed this Jul 3, 2025

Yun-Kim reviewed Jul 7, 2025

View reviewed changes

gary-huang reopened this Jul 8, 2025

Yun-Kim mentioned this pull request Jul 9, 2025

chore(llmobs): add base experiments classes #13930

Merged

2 tasks

brettlangdon requested changes Jul 11, 2025

View reviewed changes

docker/.python-version

3.12

3.8

3.9

3.10

Copy link

Member

brettlangdon Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we deleting this? this is going to break things.

		if asbool(os.getenv("DD_EXPERIMENTS_RUNNER_ENABLED")):
		data["_dd.scope"] = "experiments"

-.12
-.8
-.9
-.10

feat(llmobs): add datasets and experiments features #13314

Are you sure you want to change the base?

feat(llmobs): add datasets and experiments features #13314

Conversation

jjxct commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Reviewer Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bootstrap import analysis

Summary

Import time breakdown

Uh oh!

pr-commenter bot commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

scenario:span-add-metrics

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brettlangdon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjxct commented May 1, 2025 •

edited

Loading

github-actions bot commented May 1, 2025 •

edited

Loading

github-actions bot commented May 1, 2025 •

edited

Loading

pr-commenter bot commented May 2, 2025 •

edited

Loading