-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serverless] Extract EventBridge from SNS/SQS Payloads and Create EventBridge Spans #29551
[serverless] Extract EventBridge from SNS/SQS Payloads and Create EventBridge Spans #29551
Conversation
…NS or SQS. This adds distributed tracing support for `Lambda -> EventBridge -> SQS -> Lambda` and `Lambda -> EventBridge -> SNS -> Lambda`
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv create-vm --pipeline-id=45469028 --os-family=ubuntu Note: This applies to commit 293167c |
This comment was marked as outdated.
This comment was marked as outdated.
…rshall directly into `events.EventBridgeEvent` and use `eventBridgeCarrier()` instead of `handleEventBridgeThroughSNS()`
This comment was marked as outdated.
This comment was marked as outdated.
|
||
// Check for EventBridge event wrapped by the SNS message | ||
var eventBridgeEvent events.EventBridgeEvent | ||
if err := json.Unmarshal([]byte(event.Records[0].SNS.Message), &eventBridgeEvent); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, I wish we didn't have to unmarshal these same event messages twice, once here and once during trace context extraction. Something to think about for a future enhancement.
regression detector report failed to post again, here it is: Regression Detector ResultsRun ID: a66f3f55-1e31-47ea-b902-a8ca141361e4 Metrics dashboard Target profiles Baseline: e31532d Performance changes are noted in the perf column of each table:
No significant changes in experiment optimization goalsConfidence level: 90.00% There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | otel_to_otel_logs | ingress throughput | +2.22 | [+1.40, +3.04] | 1 | Logs |
➖ | pycheck_lots_of_tags | % cpu utilization | +0.36 | [-2.10, +2.82] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | +0.12 | [+0.07, +0.17] | 1 | Logs |
➖ | idle | memory utilization | +0.04 | [-0.00, +0.09] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.10, +0.08] | 1 | Logs |
➖ | idle_all_features | memory utilization | -0.26 | [-0.35, -0.17] | 1 | Logs |
➖ | file_tree | memory utilization | -0.48 | [-0.56, -0.39] | 1 | Logs |
➖ | basic_py_check | % cpu utilization | -0.64 | [-3.32, +2.04] | 1 | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -0.65 | [-1.38, +0.08] | 1 | Logs |
Bounds Checks
perf | experiment | bounds_check_name | replicates_passed |
---|---|---|---|
✅ | idle | memory_usage | 10/10 |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
/merge |
🚂 MergeQueue: waiting for PR to be ready This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals. Use |
Serverless Benchmark Results
tl;drUse these benchmarks as an insight tool during development.
What is this benchmarking?The The benchmark is run using a large variety of lambda request payloads. In the charts below, there is one row for each event payload type. How do I interpret these charts?The charts below comes from The benchstat docs explain how to interpret these charts.
I need more helpFirst off, do not worry if the benchmarks are failing. They are not tests. The intention is for them to be a tool for you to use during development. If you would like a hand interpreting the results come chat with us in Benchmark stats
|
🚂 MergeQueue: pull request added to the queue The median merge time in Use |
…ext (#6096) ## Summary of changes This creates a new instrumentation for EventBridge and intercepts `PutEvents` and `PutEventsAsync` to inject trace context. This allows the agent to combine spans from a distributed (serverless) architecture into a single trace. This PR only injects trace context. I'm working on [PR 1](DataDog/datadog-agent#29414) and [PR 2](DataDog/datadog-agent#29551) to update the Lambda extension to use this trace context to create EventBridge spans. I am also working on a similar PR in [dd-trace-java](DataDog/dd-trace-java#7613) and dd-trace-go. ## Reason for change SNS and SQS are already supported, and the tracer currently injects trace context into message attributes fields for them. However, EventBridge wasn't supported, and this PR aims to fix this problem. ## Implementation details I followed the [documentation](https://github.com/DataDog/dd-trace-dotnet/blob/master/docs/development/AutomaticInstrumentation.md) to create an instrumentation. Much of the logic was mirrored from the [existing implementation](https://github.com/DataDog/dd-trace-dotnet/tree/master/tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/AWS/SNS) of SNS, since EventBridge and SNS are extremely similar. Overall, AWS's EventBridge API is lacking some features, so we have to do some hacky solutions. - SNS and SQS call their custom input field messageAttributes, and EventBridge calls it detail - Unlike SNS and SQS, the detail field is given as a raw string. Therefore, we have to manually modify the detail string using StringBuilder. - The agent has no reliable way of getting the start time of the EventBridge span, so the tracer has to put the current time into `detail[_datadog]` under the header `x-datadog-start-time` - The EventBridge API has no way of getting the EventBridge bus name, so the tracer has to put the bus name (which is used to create the span resource name) into `detail[_datadog]` under the header `x-datadog-resource-name` ## Test coverage I added system tests for SNS/SQS: DataDog/system-tests#3204 I added [unit tests](d05eb4c) and [integration tests](5ccd8b7). Unit tests can be ran with: ``` cd tracer dotnet test ./test/Datadog.Trace.ClrProfiler.Managed.Tests ``` Integration tests can be ran with these commands: ``` cd tracer # Start docker localstock docker run --rm -it -p 4566:4566 -p 4571:4571 -e SERVICES=events localstack/localstack # Run integation tests ./build.sh BuildAndRunOSxIntegrationTests -buildConfiguration Debug -framework net6.0 -Filter AwsEventBridgeTests -SampleName Samples.AWS.EventBridge ``` I also did manual testing: <img width="505" alt="Screenshot 2024-09-30 at 11 00 47 AM" src="https://github.com/user-attachments/assets/bdf5d516-8b46-4138-ac25-c45d1822dc56"> ## Other details There are lots of diffs and files changed. I recommend reviewers to review the PR commit by commit. All the autogenerated files were added in a single commit, which should make the review process less overwhelming. <!--⚠️ Note: where possible, please obtain 2 approvals prior to merging. Unless CODEOWNERS specifies otherwise, for external teams it is typically best to have one review from a team member, and one review from apm-dotnet. Trivial changes do not require 2 reviews. --> --------- Co-authored-by: Steven Bouwkamp <steven.bouwkamp@datadoghq.com>
What does this PR do?
Lambda -> EventBridge -> SQS -> Lambda
andLambda -> EventBridge -> SNS -> Lambda
.Motivation
Lambda -> SNS -> SQS -> Lambda
is already supported.Lambda -> EventBridge -> SQS -> Lambda
) are very common, so we want to add support.Describe how to test/QA your changes
Run local tests with
invoke test --targets=./pkg/serverless
from the root of the projectIt's more complicated to test manually. Contact me if you want a live demo of these changes. Here's how I tested:
datadog-ci
./scripts/publish_sandbox.sh
./gradlew publishToMavenLocal
JAVA_TOOL_OPTIONS: -javaagent:"/opt/dd-java-agent-1.40.0-SNAPSHOT.jar"
on your lambda functions to use the new development build of the Java tracerPossible Drawbacks / Trade-offs
Additional Notes
Lambda -> EventBridge -> Lambda