-
Notifications
You must be signed in to change notification settings - Fork 447
feat(llmobs): add datasets and experiments features #13314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 279 ± 4 ms. The average import time from base is: 280 ± 2 ms. The import time difference between this PR and base is: -0.9 ± 0.1 ms. Import time breakdownThe following import paths have shrunk:
|
BenchmarksBenchmark execution time: 2025-07-01 23:14:14 Comparing candidate commit 1d5f1b7 in PR branch Found 1 performance improvements and 0 performance regressions! Performance is the same for 568 metrics, 3 unstable metrics. scenario:span-add-metrics
|
if site is None: | ||
site = os.getenv("DD_SITE") | ||
if site is None: | ||
raise ConfigurationError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we say in the docstring that this defaults to "datadoghq.com"
but raise here. I think we should default it and not raise here.
from .utils._exceptions import DatasetFileError | ||
|
||
if TYPE_CHECKING: | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could interfere with a user's typechecker if they don't have pandas installed when type checking their own app. We should still guard this with a try
self.add(record) | ||
return self | ||
|
||
def add(self, record: Dict[str, Union[str, Dict[str, Any]]]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defining a record type could drastically improve usability here.
class Record(TypedDict):
input: str
expected_output: Dict[str, JsonType]
for example
|
||
# Experiments related | ||
EXPECTED_OUTPUT = "_ml_obs.meta.input.expected_output" | ||
EXPERIMENT_INPUT = "_ml_obs.meta.input" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confused as to why we need new fields for input and output here but maybe it will become obvious from further reading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not obvious after further reading, @gary-huang can we just reuse meta.input.value
for the span I/O? Or is there any specialized I/O formatting/structure for experiments spans
@@ -241,6 +253,12 @@ def _llmobs_tags(span: Span, ml_app: str, session_id: Optional[str] = None) -> L | |||
"language": "python", | |||
"error": span.error, | |||
} | |||
|
|||
# Add experiment_id from baggage if present | |||
experiment_id = span.context.get_baggage_item(EXPERIMENT_ID_BAGGAGE_KEY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should just look generally in the context for the experiment ID to follow the same paradigm we do for parent id and mlobs trace id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's separate out this ui code and printing logic to an additional PR. There's too much going on here
docker/.python-version
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should inline these file contents to the tests
def reset_config_state(): | ||
"""Resets global configuration state before each test.""" | ||
# Reset global state variables to defaults | ||
config._IS_INITIALIZED = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these values aren't kept in sync with the defaults, definitely want to move away from the globals
@pytest.fixture(scope="module", autouse=True) | ||
def init_llmobs(experiments_vcr): | ||
# Use the provided keys directly. VCR filtering handles redaction. | ||
api_key = DD_API_KEY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely don't want to use globals in the tests
setting seems to have been introduced to make testing easier
if asbool(os.getenv("DD_EXPERIMENTS_RUNNER_ENABLED")): | ||
data["_dd.scope"] = "experiments" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we treating experiments spans differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's under a different scope on the track
def process_row(idx_row): | ||
idx, row = idx_row | ||
start_time = time.time() | ||
with LLMObs._experiment(name=self.task.__name__, experiment_id=self._datadog_experiment_id) as span: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this to cover the entire run_task method
LLMObs.flush() | ||
time.sleep(API_PROCESSING_TIME_SLEEP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sleep and flush here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing to get the remaining items in the writers to be submitted, the sleep is used to account for backend replication lag (we can print a link to the experiment / dataset immediately but if the user clicks on it immediately it may not be ready in the backend)
we should do away with the sleep and do something better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flush is to ensure spans get submitted if in serverless env (task ends). Fine with force flushing for now, but sleeping is weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should replace sleep with more verbose message (your trace/experiment will be available momentarily at ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird question but here is just for submitting spans no? Why would we need to print a link to the experiment here? Shouldn't we be printing the link once the evals are finished/submitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I randomly stumbled here, but if y'all identify weird bits of logic to accommodate back-end behaviors, let me know and we can improve it. I agree that putting a sleep anywhere is very weird and if it's just to generate a link, we should remove it. These small nice to haves can come later.
Adds skeleton code for Experiments, experiment tasks, and experiment evaluator classes/decorators. Implementation of experiment run() has been left out for a follow-up PR. Basic structure of Experiments: - Call LLMObs.experiment_task as a decorator to wrap a task function (must have `input` as an arg) - Call LLMObs.experiment_evaluator as a decorator to wrap an evaluator function (must have `input/output/expected_output` as args) - Create a Dataset - Create an Experiment(name: str, task, dataset, evaluators, description, config) - Call experiment.run(...) Some concerns: - Should experiment task/evaluator decorators support async/generator methods? Currently (and based on #13314) it only supports sync methods. - The ExperimentTask wrapper class requires `input` as an arg name, which shadows Python builtins. ## Checklist - [x] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to bring back docker/.python-version
3.12 | ||
3.8 | ||
3.9 | ||
3.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we deleting this? this is going to break things.
[WIP] Add proper description
Checklist
Reviewer Checklist