Skip to content

chore(llmobs): dac strip io from OpenAI #13791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

jsimpher
Copy link
Contributor

@jsimpher jsimpher commented Jun 26, 2025

Remove potentially sensitive i/o data from apm spans. This way, prompt and completion data will only appear on the llm obs spans, which are/will be subject to data access controls.

Mostly, this just removes io tag sets. A few things (mostly metrics) have llmobs tags dependent on span tags, so there is a bit more refactoring there.

Let me know if I removed anything that should really stay, or if I missed something that should be restricted.

This one does a lot that the others don't. I've left things like audio transcript and image/file retrieval that we don't duplicate.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

github-actions bot commented Jun 26, 2025

CODEOWNERS have been resolved as:

releasenotes/notes/remove-io-data-from-apm-span-openai-integration-81f3ae914a5d2faf.yaml  @DataDog/apm-python
ddtrace/contrib/internal/openai/_endpoint_hooks.py                      @DataDog/ml-observability
ddtrace/contrib/internal/openai/patch.py                                @DataDog/ml-observability
ddtrace/contrib/internal/openai/utils.py                                @DataDog/ml-observability
ddtrace/llmobs/_integrations/openai.py                                  @DataDog/ml-observability
ddtrace/llmobs/_integrations/utils.py                                   @DataDog/ml-observability
ddtrace/llmobs/_utils.py                                                @DataDog/ml-observability
tests/contrib/openai/test_openai_llmobs.py                              @DataDog/ml-observability
tests/contrib/openai/test_openai_v1.py                                  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_acompletion.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_azure_openai_chat_completion.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_azure_openai_completion.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_azure_openai_embedding.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_chat_completion.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_chat_completion_function_calling.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_chat_completion_image_input.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_chat_completion_stream.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_completion.json   @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_completion_stream.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_create_moderation.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_embedding.json    @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_embedding_array_of_token_arrays.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_embedding_string_array.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_embedding_token_array.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_file_create.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_file_delete.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_file_download.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_file_list.json    @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_file_retrieve.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_image_b64_json_response.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_image_create.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_misuse.json       @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_model_delete.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_model_list.json   @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_model_retrieve.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_response.json     @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_response_error.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_response_stream.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_response_tools.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_response_tools_stream.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai.test_span_finish_on_stream_error.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_completion_stream_est_tokens.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_empty_streamed_chat_completion_resp_returns.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_empty_streamed_completion_resp_returns.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_empty_streamed_response_resp_returns.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_integration_async.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_integration_service_name[None-None].json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_integration_service_name[None-v0].json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_integration_service_name[None-v1].json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_integration_service_name[mysvc-None].json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_integration_service_name[mysvc-v0].json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_integration_service_name[mysvc-v1].json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai.test_openai_v1.test_integration_sync.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai_agents.test_openai_agents.test_openai_agents.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai_agents.test_openai_agents.test_openai_agents_streaming.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai_agents.test_openai_agents.test_openai_agents_sync.json  @DataDog/ml-observability
tests/snapshots/tests.contrib.openai_agents.test_openai_agents.test_openai_agents_with_tool_error.json  @DataDog/ml-observability

Copy link
Contributor

github-actions bot commented Jun 26, 2025

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 291 ± 7 ms.

The average import time from base is: 290 ± 7 ms.

The import time difference between this PR and base is: 0.7 ± 0.3 ms.

The difference is not statistically significant (z = 2.24).

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.246 ms (0.43%)
ddtrace 0.648 ms (0.22%)
ddtrace.bootstrap.sitecustomize 0.598 ms (0.21%)
ddtrace.bootstrap.preload 0.598 ms (0.21%)
ddtrace.internal.remoteconfig.client 0.598 ms (0.21%)

@pr-commenter
Copy link

pr-commenter bot commented Jun 26, 2025

Benchmarks

Benchmark execution time: 2025-07-15 16:32:40

Comparing candidate commit ae0269f in PR branch jsimpher/dac-strip-io-from-openai with baseline commit c81e594 in branch main.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 391 metrics, 2 unstable metrics.

scenario:iastaspectsospath-ospathjoin_aspect

  • 🟥 execution_time [+924.513ns; +1001.213ns] or [+14.994%; +16.238%]

@@ -164,7 +178,7 @@ def _llmobs_set_meta_tags_from_embedding(span: Span, kwargs: Dict[str, Any], res
span._set_ctx_item(OUTPUT_VALUE, "[{} embedding(s) returned]".format(len(resp.data)))

@staticmethod
def _extract_llmobs_metrics_tags(span: Span, resp: Any, span_kind: str) -> Dict[str, Any]:
def _extract_llmobs_metrics_tags(span: Span, resp: Any, span_kind: str) -> Optional[Dict[str, Any]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Code Quality Violation

do not use Any, use a concrete type (...read more)

Use the Any type very carefully. Most of the time, the Any type is used because we do not know exactly what type is being used. If you want to specify that a value can be of any type, use object instead of Any.

Learn More

View in Datadog  Leave us feedback  Documentation

@jsimpher jsimpher marked this pull request as ready for review June 27, 2025 16:42
@jsimpher jsimpher requested review from a team as code owners June 27, 2025 16:42
@jsimpher jsimpher requested review from P403n1x87 and quinna-h June 27, 2025 16:42
Copy link
Contributor

@ncybul ncybul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left some comments / questions! Lmk when you need another review!

@@ -133,7 +147,7 @@ def _llmobs_set_tags(
elif operation == "response":
openai_set_meta_tags_from_response(span, kwargs, response)
update_proxy_workflow_input_output_value(span, span_kind)
metrics = self._extract_llmobs_metrics_tags(span, response, span_kind)
metrics = self._extract_llmobs_metrics_tags(span, response, span_kind) or span._get_ctx_item(METRICS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you confirm my understanding -- if the response is streamed, we expect the metrics to be on the span context and if the response is not streamed, then we need to extract the token usage from the response itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, ._extract_llmobs_metrics_tags used to act on response usage when available (nonstreamed) and pull from apm span when not available (streamed). but of course now it isn't on the apm span tags anymore.
so the idea here is to keep the exact same flow, but replace the write/read on apm span tags with write/read to apm span context tags. of course, it looks a bit redundant here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to just centralize token count extraction to one place instead of doing it in both places and praying one of them has it?

Copy link
Contributor

@ncybul ncybul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my previous comments! In general, things look good, I am just wondering whether we should be even more heavy-handed removing tags from the APM spans that probably never needed to be there anyway. That might be outside the scope of this PR, but would be a good opportunity to clean things up even more. Lmk if any of that is unclear!



def _set_token_metrics_from_streamed_response(span, response, prompts, messages, kwargs):
def _set_token_metrics_from_streamed_response(span, integration, response, prompts, messages, kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method even necessary anymore since we're not tagging token metrics on APM spans? shouldn't the llmobs token metric extraction/tagging happen in integration._extract_llmobs_metrics_tags() which gets called inside llmobs_set_tags()?

It's been a while since I've touched this so might be wrong in my assumption. Please correct me if that's not correct

Copy link
Contributor Author

@jsimpher jsimpher Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a path where it is used, basically called in _process_finished_stream.
technicially we can move all of that (this whole chain of token estimation that kicks off in that case) over to llmobs, since it only depends on resp and kwargs. ill do that now.

@@ -96,14 +95,29 @@ def _is_provider(span, provider):
return False
return provider.lower() in base_url.lower()

def record_usage(self, span: Span, usage: Dict[str, Any]) -> None:
def llmobs_record_usage(self, span: Span, usage: Dict[str, Any]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used for setting token metrics on the LLMObs span, but doesn't this already happen in _extract_llmobs_metrics_tags() in _llmobs_set_tags()? I.e. is this method redundant now?

@@ -133,7 +147,7 @@ def _llmobs_set_tags(
elif operation == "response":
openai_set_meta_tags_from_response(span, kwargs, response)
update_proxy_workflow_input_output_value(span, span_kind)
metrics = self._extract_llmobs_metrics_tags(span, response, span_kind)
metrics = self._extract_llmobs_metrics_tags(span, response, span_kind) or span._get_ctx_item(METRICS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to just centralize token count extraction to one place instead of doing it in both places and praying one of them has it?

jsimpher and others added 3 commits July 11, 2025 09:06
…tion-81f3ae914a5d2faf.yaml

Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
@jsimpher jsimpher requested review from Yun-Kim and ncybul July 14, 2025 13:49
@@ -87,7 +87,7 @@ def _set_base_span_tags(self, span: Span, **kwargs) -> None:
client = "AzureOpenAI"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh crap I just realized openai.api_base and openai.base_url are required tags for us to detect which provider in OpenAIIntegration._is_provider(). Can we try keeping the tags removed but keep the _is_provider() method functional? (i.e. let's pass in the client object directly to _is_provider)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 just saw the below changes. NVM!

@@ -87,7 +87,7 @@ def _set_base_span_tags(self, span: Span, **kwargs) -> None:
client = "AzureOpenAI"
elif self._is_provider(span, "deepseek"):
client = "Deepseek"
span.set_tag_str("openai.request.client", client)
span.set_tag_str("openai.request.provider", client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason we're changing the name from client to provider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested by @ncybul here

Comment on lines 192 to 196
elif kwargs.get("stream") and resp is not None:
return get_token_metrics_from_streamed_response(
span, resp, kwargs.get("prompt", None), kwargs.get("messages", None), kwargs
)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we've moved out extracting token metrics to the integration level, but I can't help but notice get_token_metrics_from_streamed_response has almost 1:1 duplicate code with https://github.com/DataDog/dd-trace-py/pull/13791/files#diff-5b467cc82f5b96a7124a28da35a07477e10b42bd51b02903f8985854bc14b824R166-R183. Can we modify this to avoid having redundant code? i.e. if we notice that the token usage is not provided (which means it was a streamed response with user explicitly setting return_token_info=False), can we just directly run token estimation helpers?

Comment on lines 1256 to 1267
usage = None
if response and isinstance(response, list) and _get_attr(response[0], "usage", None):
usage = response[0].get("usage", {})
elif response and getattr(response, "usage", None):
usage = response.usage

if usage:
if hasattr(usage, "input_tokens") or hasattr(usage, "prompt_tokens"):
prompt_tokens = getattr(usage, "input_tokens", 0) or getattr(usage, "prompt_tokens", 0)
if hasattr(usage, "output_tokens") or hasattr(usage, "completion_tokens"):
completion_tokens = getattr(usage, "output_tokens", 0) or getattr(usage, "completion_tokens", 0)
total_tokens = getattr(usage, "total_tokens", 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks suspiciously like _extract_llmobs_metrics_tags

Comment on lines 1269 to 1272
model_name = span.get_tag("openai.response.model") or kwargs.get("model", "")
_, prompt_tokens = _compute_prompt_tokens(model_name, prompts, messages)
_, completion_tokens = _compute_completion_tokens(response, model_name)
total_tokens = prompt_tokens + completion_tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is realistically the only part we need to keep that's not duplicate with _extract_llmobs_metrics_tags, given there is no actual token usage metrics provided from openai

Copy link
Contributor

@Yun-Kim Yun-Kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last question

return metrics
return get_llmobs_metrics_tags("openai", span)
elif kwargs.get("stream") and resp is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking for confirmation - is this condition being hit for both streamed responses and regular responses that have token metrics? If not, doesn't this mean that all streamed responses will go through this branch (even if token metrics are present?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surely the if that this elifs would take the ones with token metrics? or am i misunderstanding

Copy link
Contributor Author

@jsimpher jsimpher Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i think i misread you. i am of the belief that the if condition should fire for both streamed and nonstreamed, and that the elif would only trigger for streamed without token_usage. i could be mistaken, but the check mirrors the one removed from the _set_metric_stream here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thanks for the sanity check. I guess if tests for streamed token metrics are passing then this makes sense 👍

…tion-81f3ae914a5d2faf.yaml

Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants