-
Notifications
You must be signed in to change notification settings - Fork 447
chore(llmobs): dac strip io from OpenAI #13791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 291 ± 7 ms. The average import time from base is: 290 ± 7 ms. The import time difference between this PR and base is: 0.7 ± 0.3 ms. The difference is not statistically significant (z = 2.24). Import time breakdownThe following import paths have shrunk:
|
BenchmarksBenchmark execution time: 2025-07-15 16:32:40 Comparing candidate commit ae0269f in PR branch Found 0 performance improvements and 1 performance regressions! Performance is the same for 391 metrics, 2 unstable metrics. scenario:iastaspectsospath-ospathjoin_aspect
|
@@ -164,7 +178,7 @@ def _llmobs_set_meta_tags_from_embedding(span: Span, kwargs: Dict[str, Any], res | |||
span._set_ctx_item(OUTPUT_VALUE, "[{} embedding(s) returned]".format(len(resp.data))) | |||
|
|||
@staticmethod | |||
def _extract_llmobs_metrics_tags(span: Span, resp: Any, span_kind: str) -> Dict[str, Any]: | |||
def _extract_llmobs_metrics_tags(span: Span, resp: Any, span_kind: str) -> Optional[Dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟠 Code Quality Violation
do not use Any, use a concrete type (...read more)
Use the Any
type very carefully. Most of the time, the Any
type is used because we do not know exactly what type is being used. If you want to specify that a value can be of any type, use object
instead of Any
.
Learn More
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, left some comments / questions! Lmk when you need another review!
@@ -133,7 +147,7 @@ def _llmobs_set_tags( | |||
elif operation == "response": | |||
openai_set_meta_tags_from_response(span, kwargs, response) | |||
update_proxy_workflow_input_output_value(span, span_kind) | |||
metrics = self._extract_llmobs_metrics_tags(span, response, span_kind) | |||
metrics = self._extract_llmobs_metrics_tags(span, response, span_kind) or span._get_ctx_item(METRICS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you confirm my understanding -- if the response is streamed, we expect the metrics to be on the span context and if the response is not streamed, then we need to extract the token usage from the response itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, ._extract_llmobs_metrics_tags
used to act on response usage when available (nonstreamed) and pull from apm span when not available (streamed). but of course now it isn't on the apm span tags anymore.
so the idea here is to keep the exact same flow, but replace the write/read on apm span tags with write/read to apm span context tags. of course, it looks a bit redundant here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to just centralize token count extraction to one place instead of doing it in both places and praying one of them has it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my previous comments! In general, things look good, I am just wondering whether we should be even more heavy-handed removing tags from the APM spans that probably never needed to be there anyway. That might be outside the scope of this PR, but would be a good opportunity to clean things up even more. Lmk if any of that is unclear!
|
||
|
||
def _set_token_metrics_from_streamed_response(span, response, prompts, messages, kwargs): | ||
def _set_token_metrics_from_streamed_response(span, integration, response, prompts, messages, kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method even necessary anymore since we're not tagging token metrics on APM spans? shouldn't the llmobs token metric extraction/tagging happen in integration._extract_llmobs_metrics_tags() which gets called inside llmobs_set_tags()?
It's been a while since I've touched this so might be wrong in my assumption. Please correct me if that's not correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a path where it is used, basically called in _process_finished_stream.
technicially we can move all of that (this whole chain of token estimation that kicks off in that case) over to llmobs, since it only depends on resp and kwargs. ill do that now.
@@ -96,14 +95,29 @@ def _is_provider(span, provider): | |||
return False | |||
return provider.lower() in base_url.lower() | |||
|
|||
def record_usage(self, span: Span, usage: Dict[str, Any]) -> None: | |||
def llmobs_record_usage(self, span: Span, usage: Dict[str, Any]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used for setting token metrics on the LLMObs span, but doesn't this already happen in _extract_llmobs_metrics_tags()
in _llmobs_set_tags()
? I.e. is this method redundant now?
@@ -133,7 +147,7 @@ def _llmobs_set_tags( | |||
elif operation == "response": | |||
openai_set_meta_tags_from_response(span, kwargs, response) | |||
update_proxy_workflow_input_output_value(span, span_kind) | |||
metrics = self._extract_llmobs_metrics_tags(span, response, span_kind) | |||
metrics = self._extract_llmobs_metrics_tags(span, response, span_kind) or span._get_ctx_item(METRICS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to just centralize token count extraction to one place instead of doing it in both places and praying one of them has it?
releasenotes/notes/remove-io-data-from-apm-span-openai-integration-81f3ae914a5d2faf.yaml
Outdated
Show resolved
Hide resolved
…tion-81f3ae914a5d2faf.yaml Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
@@ -87,7 +87,7 @@ def _set_base_span_tags(self, span: Span, **kwargs) -> None: | |||
client = "AzureOpenAI" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh crap I just realized openai.api_base
and openai.base_url
are required tags for us to detect which provider in OpenAIIntegration._is_provider()
. Can we try keeping the tags removed but keep the _is_provider() method functional? (i.e. let's pass in the client object directly to _is_provider)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 just saw the below changes. NVM!
@@ -87,7 +87,7 @@ def _set_base_span_tags(self, span: Span, **kwargs) -> None: | |||
client = "AzureOpenAI" | |||
elif self._is_provider(span, "deepseek"): | |||
client = "Deepseek" | |||
span.set_tag_str("openai.request.client", client) | |||
span.set_tag_str("openai.request.provider", client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason we're changing the name from client to provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif kwargs.get("stream") and resp is not None: | ||
return get_token_metrics_from_streamed_response( | ||
span, resp, kwargs.get("prompt", None), kwargs.get("messages", None), kwargs | ||
) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that we've moved out extracting token metrics to the integration level, but I can't help but notice get_token_metrics_from_streamed_response
has almost 1:1 duplicate code with https://github.com/DataDog/dd-trace-py/pull/13791/files#diff-5b467cc82f5b96a7124a28da35a07477e10b42bd51b02903f8985854bc14b824R166-R183. Can we modify this to avoid having redundant code? i.e. if we notice that the token usage is not provided (which means it was a streamed response with user explicitly setting return_token_info=False
), can we just directly run token estimation helpers?
usage = None | ||
if response and isinstance(response, list) and _get_attr(response[0], "usage", None): | ||
usage = response[0].get("usage", {}) | ||
elif response and getattr(response, "usage", None): | ||
usage = response.usage | ||
|
||
if usage: | ||
if hasattr(usage, "input_tokens") or hasattr(usage, "prompt_tokens"): | ||
prompt_tokens = getattr(usage, "input_tokens", 0) or getattr(usage, "prompt_tokens", 0) | ||
if hasattr(usage, "output_tokens") or hasattr(usage, "completion_tokens"): | ||
completion_tokens = getattr(usage, "output_tokens", 0) or getattr(usage, "completion_tokens", 0) | ||
total_tokens = getattr(usage, "total_tokens", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks suspiciously like _extract_llmobs_metrics_tags
model_name = span.get_tag("openai.response.model") or kwargs.get("model", "") | ||
_, prompt_tokens = _compute_prompt_tokens(model_name, prompts, messages) | ||
_, completion_tokens = _compute_completion_tokens(response, model_name) | ||
total_tokens = prompt_tokens + completion_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is realistically the only part we need to keep that's not duplicate with _extract_llmobs_metrics_tags
, given there is no actual token usage metrics provided from openai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last question
return metrics | ||
return get_llmobs_metrics_tags("openai", span) | ||
elif kwargs.get("stream") and resp is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking for confirmation - is this condition being hit for both streamed responses and regular responses that have token metrics? If not, doesn't this mean that all streamed responses will go through this branch (even if token metrics are present?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surely the if that this elifs would take the ones with token metrics? or am i misunderstanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i think i misread you. i am of the belief that the if condition should fire for both streamed and nonstreamed, and that the elif would only trigger for streamed without token_usage. i could be mistaken, but the check mirrors the one removed from the _set_metric_stream here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah thanks for the sanity check. I guess if tests for streamed token metrics are passing then this makes sense 👍
releasenotes/notes/remove-io-data-from-apm-span-openai-integration-81f3ae914a5d2faf.yaml
Outdated
Show resolved
Hide resolved
…tion-81f3ae914a5d2faf.yaml Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
Remove potentially sensitive i/o data from apm spans. This way, prompt and completion data will only appear on the llm obs spans, which are/will be subject to data access controls.
Mostly, this just removes io tag sets. A few things (mostly metrics) have llmobs tags dependent on span tags, so there is a bit more refactoring there.
Let me know if I removed anything that should really stay, or if I missed something that should be restricted.
This one does a lot that the others don't. I've left things like audio transcript and image/file retrieval that we don't duplicate.
Checklist
Reviewer Checklist