Skip to content

feat(llmobs): add datasets and experiments features #13314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

jjxct
Copy link
Contributor

@jjxct jjxct commented May 1, 2025

[WIP] Add proper description

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

@jjxct jjxct requested a review from a team as a code owner May 1, 2025 18:21
Copy link
Contributor

github-actions bot commented May 1, 2025

CODEOWNERS have been resolved as:

ddtrace/llmobs/experimentation/__init__.py                              @DataDog/ml-observability
ddtrace/llmobs/experimentation/_config.py                               @DataDog/ml-observability
ddtrace/llmobs/experimentation/_dataset.py                              @DataDog/ml-observability
ddtrace/llmobs/experimentation/_decorators.py                           @DataDog/ml-observability
ddtrace/llmobs/experimentation/_experiment.py                           @DataDog/ml-observability
ddtrace/llmobs/experimentation/utils/_exceptions.py                     @DataDog/ml-observability
ddtrace/llmobs/experimentation/utils/_http.py                           @DataDog/ml-observability
ddtrace/llmobs/experimentation/utils/_ui.py                             @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_init_implicit_pull.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_init_implicit_pull_existing.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_init_implicit_pull_nonexistent.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_meals_workouts_setup.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_empty.yaml         @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_for_repr_test.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_for_repr_unsynced_test.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_for_sync_tests.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_latest.yaml        @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_latest_meals_workouts.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_meals_workouts.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_nonexistent.yaml   @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_nonexistent_version.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_nonexistent_version_meals_workouts.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_specific_version.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_pull_specific_version_meals_workouts.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_collision_new_version.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_collision_no_flags.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_collision_overwrite.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_collision_setup.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_large_chunking.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_new.yaml           @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_adds.yaml   @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_deletes.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_mixed.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_new_version.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_no_change.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_overwrite.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_dataset_push_synced_updates.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_push_summary_metric_boolean.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_push_summary_metric_categorical.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_push_summary_metric_numeric.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_repr_full_run.yaml   @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_results_repr_summaries.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_evals_error_no_raise.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_evals_override.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_evals_success.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_full_eval_error.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_full_success.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_full_task_error.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_summary_metrics.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_summary_metrics_push_fail.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_task_error_no_raise.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_task_error_raise.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_task_sample.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_task_setup.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_run_task_with_config.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_setup_pull_dataset.yaml  @DataDog/ml-observability
tests/llmobs/experiments_cassettes/test_experiment_setup_push_dataset.yaml  @DataDog/ml-observability
tests/llmobs/experiments_files/delimiter.tsv                            @DataDog/ml-observability
tests/llmobs/experiments_files/empty.csv                                @DataDog/ml-observability
tests/llmobs/experiments_files/header_only.csv                          @DataDog/ml-observability
tests/llmobs/experiments_files/malformed.csv                            @DataDog/ml-observability
tests/llmobs/experiments_files/multi.csv                                @DataDog/ml-observability
tests/llmobs/experiments_files/simple.csv                               @DataDog/ml-observability
tests/llmobs/test_experimentation_config.py                             @DataDog/ml-observability
tests/llmobs/test_experimentation_dataset.py                            @DataDog/ml-observability
tests/llmobs/test_experimentation_decorators.py                         @DataDog/ml-observability
tests/llmobs/test_experimentation_experiment.py                         @DataDog/ml-observability
ddtrace/contrib/internal/openai/_endpoint_hooks.py                      @DataDog/ml-observability
ddtrace/llmobs/_constants.py                                            @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_utils.py                                                @DataDog/ml-observability
ddtrace/llmobs/_writer.py                                               @DataDog/ml-observability
tests/llmobs/test_utils.py                                              @DataDog/ml-observability

Copy link
Contributor

github-actions bot commented May 1, 2025

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 235 ± 3 ms.

The average import time from base is: 235 ± 3 ms.

The import time difference between this PR and base is: -0.5 ± 0.1 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.776 ms (0.76%)
ddtrace.bootstrap.sitecustomize 1.103 ms (0.47%)
ddtrace.bootstrap.preload 1.103 ms (0.47%)
ddtrace.internal.remoteconfig.client 0.555 ms (0.24%)
ddtrace 0.674 ms (0.29%)

@pr-commenter
Copy link

pr-commenter bot commented May 2, 2025

Benchmarks

Benchmark execution time: 2025-05-19 20:06:37

Comparing candidate commit a648fe5 in PR branch llm-experiments-rebase with baseline commit bc56c53 in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 507 metrics, 4 unstable metrics.

scenario:telemetryaddmetric-flush-1000-metrics

  • 🟩 execution_time [-210.109µs; -174.787µs] or [-8.716%; -7.251%]

@jjxct jjxct requested a review from a team as a code owner May 19, 2025 19:16
@jjxct jjxct requested a review from erikayasuda May 19, 2025 19:16
if site is None:
site = os.getenv("DD_SITE")
if site is None:
raise ConfigurationError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we say in the docstring that this defaults to "datadoghq.com" but raise here. I think we should default it and not raise here.

from .utils._exceptions import DatasetFileError

if TYPE_CHECKING:
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could interfere with a user's typechecker if they don't have pandas installed when type checking their own app. We should still guard this with a try

self.add(record)
return self

def add(self, record: Dict[str, Union[str, Dict[str, Any]]]) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defining a record type could drastically improve usability here.

class Record(TypedDict):
    input: str
    expected_output: Dict[str, JsonType]

for example


# Experiments related
EXPECTED_OUTPUT = "_ml_obs.meta.input.expected_output"
EXPERIMENT_INPUT = "_ml_obs.meta.input"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused as to why we need new fields for input and output here but maybe it will become obvious from further reading

@@ -241,6 +253,12 @@ def _llmobs_tags(span: Span, ml_app: str, session_id: Optional[str] = None) -> L
"language": "python",
"error": span.error,
}

# Add experiment_id from baggage if present
experiment_id = span.context.get_baggage_item(EXPERIMENT_ID_BAGGAGE_KEY)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just look generally in the context for the experiment ID to follow the same paradigm we do for parent id and mlobs trace id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's separate out this ui code and printing logic to an additional PR. There's too much going on here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be reverted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should inline these file contents to the tests

def reset_config_state():
"""Resets global configuration state before each test."""
# Reset global state variables to defaults
config._IS_INITIALIZED = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these values aren't kept in sync with the defaults, definitely want to move away from the globals

@pytest.fixture(scope="module", autouse=True)
def init_llmobs(experiments_vcr):
# Use the provided keys directly. VCR filtering handles redaction.
api_key = DD_API_KEY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely don't want to use globals in the tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants