Skip to content

feat(llmobs): add bedrock agents tracing #13439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

Yun-Kim
Copy link
Contributor

@Yun-Kim Yun-Kim commented May 16, 2025

MLOB-2660
This PR adds tracing for AWS Bedrock agents runtime.

There are 2 components to this PR:

  1. APM+LLMObs: Trace bedrock agents runtime invokeAgent() call, i.e. the end-to-end agent request.
  2. LLMObs-only: if AWS bedrock traces are returned by the response (i.e. the user adds enableTrace=True in the request params), then we translate the returned bedrock traces into LLMObs span events.

Translating AWS Bedrock Traces to Datadog LLMObs Spans

The main idea here is that we are receiving AWS bedrock "trace" events and converting/stitching them into Datadog LLMObs spans. We do three levels of tracing:

  1. E2E Agent Request: AWS Bedrock traces don't have a top-level agent "trace" concept, so we are capturing this directly by tracing the invokeAgent call.
  2. Trace steps: AWS Bedrock traces events are not hierarchical in nature and represent steps in the workflow, and contain a combination of LLM/tool/reasoning steps (instead of Datadog LLMObs and other common LLMObs frameworks which separate out each step into its own span). We are explicitly creating subtrace spans to represent each trace step by stitching together the internal trace step events' timestamps, and annotating with the input of the first inner step, and the output of the last inner step.
  3. Inner trace step: Based on the bedrock trace event, we stitch events together into LLM/Tool/Task spans.

Trace steps and inner trace Steps

As of the time of writing, Bedrock Agent Traces can consist of the following trace types, each with their own internal structure and unique types:

  • Failure
  • Guardrail
  • Orchestration
  • PreProcessing
  • PostProcessing
  • Routing
  • CustomOrchestration
  • Observation
    Inner trace step events also include but are not limited to modelInvocationInput, modelInvocationOutput, actionGroupInvocationInput, actionGroupInvocationOutput. These events need to be stitched together to create LLMObs span events. I've also made rationale and guardrail trace step events standalone spans. Also, since we're translating events to LLMObs span dictionaries, things like exceptions specified in the trace step events need to be raised explicitly in our span events in order to retain the traceback information to propagate to the root agent span.

For more information on the Bedrock Agent Trace structure/formats, see their docs.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

github-actions bot commented May 16, 2025

CODEOWNERS have been resolved as:

ddtrace/contrib/internal/botocore/services/bedrock_agents.py            @DataDog/ml-observability
ddtrace/llmobs/_integrations/bedrock_agents.py                          @DataDog/ml-observability
ddtrace/llmobs/_integrations/bedrock_utils.py                           @DataDog/ml-observability
releasenotes/notes/feat-bedrock-agents-efb725a0860eae87.yaml            @DataDog/apm-python
tests/contrib/botocore/bedrock_cassettes/agent_invoke.yaml              @DataDog/ml-observability
tests/contrib/botocore/bedrock_cassettes/agent_invoke_trace_disabled.yaml  @DataDog/ml-observability
tests/contrib/botocore/conftest.py                                      @DataDog/apm-core-python @DataDog/apm-idm-python
tests/contrib/botocore/test_bedrock_agents.py                           @DataDog/ml-observability
tests/contrib/botocore/test_bedrock_agents_llmobs.py                    @DataDog/ml-observability
tests/snapshots/tests.contrib.botocore.test_bedrock_agents.test_agent_invoke.json  @DataDog/apm-python
.github/CODEOWNERS                                                      @DataDog/python-guild @DataDog/apm-core-python
ddtrace/contrib/integration_registry/registry.yaml                      @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/contrib/internal/botocore/patch.py                              @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/contrib/internal/botocore/services/bedrock.py                   @DataDog/ml-observability
ddtrace/llmobs/_integrations/bedrock.py                                 @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
riotfile.py                                                             @DataDog/apm-python
supported_versions_output.json                                          @DataDog/apm-core-python
supported_versions_table.csv                                            @DataDog/apm-core-python
tests/contrib/botocore/bedrock_utils.py                                 @DataDog/ml-observability
tests/contrib/botocore/test_bedrock.py                                  @DataDog/ml-observability
tests/contrib/botocore/test_bedrock_llmobs.py                           @DataDog/ml-observability
.riot/requirements/14fceda.txt                                          @DataDog/apm-python
.riot/requirements/1558546.txt                                          @DataDog/apm-python
.riot/requirements/160bd16.txt                                          @DataDog/apm-python
.riot/requirements/17fe359.txt                                          @DataDog/apm-python
.riot/requirements/1ada48c.txt                                          @DataDog/apm-python

Copy link
Contributor

github-actions bot commented May 16, 2025

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 279 ± 3 ms.

The average import time from base is: 281 ± 4 ms.

The import time difference between this PR and base is: -1.6 ± 0.2 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.945 ms (0.70%)
ddtrace.bootstrap.sitecustomize 1.272 ms (0.46%)
ddtrace.bootstrap.preload 1.272 ms (0.46%)
ddtrace.internal.remoteconfig.client 0.628 ms (0.23%)
ddtrace 0.674 ms (0.24%)
ddtrace.internal._unpatched 0.030 ms (0.01%)
json 0.030 ms (0.01%)
json.decoder 0.030 ms (0.01%)
re 0.030 ms (0.01%)
enum 0.030 ms (0.01%)
types 0.030 ms (0.01%)

@pr-commenter
Copy link

pr-commenter bot commented May 16, 2025

Benchmarks

Benchmark execution time: 2025-06-16 21:37:25

Comparing candidate commit f6a4841 in PR branch yunkim/llmobs-bedrock-agent with baseline commit 5692392 in branch main.

Found 0 performance improvements and 3 performance regressions! Performance is the same for 558 metrics, 3 unstable metrics.

scenario:iastaspectsospath-ospathsplitdrive_aspect

  • 🟥 execution_time [+375.834ns; +431.276ns] or [+10.992%; +12.614%]

scenario:iastaspectsospath-ospathsplitext_aspect

  • 🟥 execution_time [+658.342ns; +720.485ns] or [+15.334%; +16.781%]

scenario:iastaspectssplit-splitlines_aspect

  • 🟥 execution_time [+129.528ns; +159.169ns] or [+8.855%; +10.882%]

Copy link
Contributor

@sabrenner sabrenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a first pass, mostly nits, i think actionable for when you're back. it largely makes sense to me. i'll do a second pass at the tests and an additional pass on the bedrock agents file (that's the one i spent the most time looking at 😅)

@Yun-Kim Yun-Kim marked this pull request as ready for review June 12, 2025 16:52
@Yun-Kim Yun-Kim requested review from a team as code owners June 12, 2025 16:52
Copy link
Contributor

@sabrenner sabrenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple questions/nits but am good to approve after they're addressed! nice work with this in such short time 😄

@Yun-Kim Yun-Kim requested a review from a team as a code owner June 16, 2025 17:00
Copy link
Contributor

@sabrenner sabrenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Comment on lines +101 to +102
span.finish()
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
span.finish()
raise
raise
finally:
span.finish()

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to finish the span if there's an error, since we're always expecting a streamed response otherwise (which we'll finish in the TracedStreamResponse wrapper class)

Comment on lines 94 to 102
try:
result = original_func(*args, **kwargs)
result = handle_bedrock_agent_response(result, integration, span, args, kwargs)
return result
except Exception:
integration.llmobs_set_tags(span, args, kwargs, result, operation="agent")
span.set_exc_info(*sys.exc_info())
span.finish()
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
result = original_func(*args, **kwargs)
result = handle_bedrock_agent_response(result, integration, span, args, kwargs)
return result
except Exception:
integration.llmobs_set_tags(span, args, kwargs, result, operation="agent")
span.set_exc_info(*sys.exc_info())
span.finish()
raise
try:
result = original_func(*args, **kwargs)
result = handle_bedrock_agent_response(result, integration, span, args, kwargs)
return result
finally:
integration.llmobs_set_tags(span, args, kwargs, result, operation="agent")

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately here we only want to call the finish/llmobs_set_tags methods in case of error, otherwise we handle them once the wrapped StreamResponse object consumes all chunks. This is just an error case handling here, I'll add a comment to clarify this

@@ -1377,7 +1377,7 @@ def select_pys(min_version: str = MIN_PYTHON_VERSION, max_version: str = MAX_PYT
),
Venv(
pys=select_pys(min_version="3.9"),
pkgs={"vcrpy": "==7.0.0", "botocore": ">=1.34.131", "boto3": ">=1.34.131"},
pkgs={"vcrpy": "==7.0.0", "botocore": "==1.38.26", "boto3": "==1.38.26"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we might have to change the min compatible version here: https://github.com/DataDog/dd-trace-py/pull/13613/files#diff-1f8812e8f134ceee7717f2f35d62d2ad3c455933bac5899e932e96c39e5cdf80 - I think @wconti27 was doing something with enforcing minimum version for SSI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we? I thought the lines above there's a base riot Venv that tests an even older version of botocore 1.34.49

Copy link
Contributor

@quinna-h quinna-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but will defer to the MLObs team for correctness of the implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants