aws-sdk SQS propagation #5167

blumamir · 2022-01-18T15:06:04Z

Hello,
I am trying to make aws-sdk SQS context propagation work between applications written in Java (scala actually), and nodejs / ruby.
In order for it to work, both the sender and receiver should agree on how to inject and extract the context onto the messages.

Unfortunately, Java and node \ ruby behave differently which creates broken traces for end-users who are sending messages across those systems.

NodeJS \ Ruby

These implementations use the "OpenTelemetry" approach. They use the propagators registered in the otel API (w3c \ b3 \ custom \ etc) to inject and extract context from the message attributes

Cons

There are quotas on SQS messages which the context propagation consumes: (https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-messages.html) which means:

User is charged for this payload which includes the message attributes. It is not expected to be expensive as "Each 64 KB chunk of a payload is billed as 1 request", but still it's a few dozen bytes that are charged.
If the message is just a bit below the 256K hard limit, then adding the context data can potentially cross this limit and reject the message.
Message attributes are limited to 10 values total, including the user values, which means if he already used the available amount, then instrumentation has no more space to inject propagation headers as well.

Java

The jave implementation is using the X-Amzn-Trace-Id header, which does not consume quotas like the Nodejs \ Ruby implementation, but has the following cons:

Cons

If X-Ray is enabled on a service, it might inject additional spans into the trace as the request is passing via the AWS x-ray enabled services. These spans are only exported to X-Ray. It means that otel users which are not using X-Ray, will have missing spans in the trace, and thus the goal to have async messaging visibility is lost.
For some services, if X-Ray is disabled, the service will still create a new trace, flag it as a non-sampled trace and propagate this information downstream. If the application is configured with parent-based sampler, then x-ray effectivly turns off tracing for the application. See this issue in node for reference and more info.
X-Ray propagator does not support baggage, which means baggage values are not propagated via AWS services even if user registers the W3C Baggage propagator.

Action Items

Considering the above Pros and Cons, I want to suggest adding a second propagation style into the aws-sdk instrumentations, which is compatible with nodejs / ruby and allow the user to bypass the x-ray propagator so the above compatibility issues will not affect his application.

I'll be very happy to get more insights and ideas on this issue. Do you think it makes sense adding this new propagation style? Maybe AWS has plans to solve the compatibility problems in the near future?

The text was updated successfully, but these errors were encountered:

anuraaga · 2022-01-18T23:25:51Z

Hi @blumamir - currently the spec is only in the context of Lambda, but the concept is similar to any handling of an SQS message I think.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/instrumentation/aws-lambda.md#sqs-message

For some services, if X-Ray is disabled, the service will still create a new trace, flag it as a non-sampled trace and propagate this information downstream. If the application is configured with parent-based sampler, then x-ray effectivly turns off tracing for the application. See this issue in node for reference and more info.

IIRC, this only happens when API gateway is involved as it is actually the one creating a trace in that issue. When disabled, it should otherwise be an invalid context with the value "Sampled=0" and no IDs. But there are some bugs in API gateway that cause problems unfortunately :(

X-Ray propagator does not support baggage, which means baggage values are not propagated via AWS services even if user registers the W3C Baggage propagator.

For reference, there is current work on supporting baggage and w3c headers within AWS services, though it will take a while to make it to existing services like SQS.

The biggest problem with avoiding AWSTraceHeader is to lose connectivity between AWS services - for example, it is the only way to trace a request from client (AWS SDK) -> S3 -> SQS. In general, there will be less problems using AWS propagation because it's the only one really recognized by the AWS service.

To work around the API gateway bug and baggage support, adding a setting here to opt-in to OTel propagation would be OK. But if your use case doesn't actually involve these, then actually it will be better served by having the other languages have an option, which should be the default similar to the spec for Lambda (we should probably update the spec to move that section to the AWS SDK instrumentation spec to indicate it's not only for Lambda). This is the only way to ensure the stablest propagation across various AWS services, S3 event bridge appsync right now, in the future we hope to have more middle-node services covered too.

Let me know if anything's not clear.

blumamir · 2022-01-20T12:00:22Z

Updating here that I am working on adding an option to use the otel propagators via message attributes in SQS, to support context propagation between nodejs / ruby <-> java.

awelless · 2023-02-20T17:49:54Z

I have also encountered the problem with propagating baggage via sns/sqs. I also put baggage into message attributes, only manually. Having this implemented in the instrumentation side would be great.
@blumamir, did you have time to work on this?

blumamir · 2023-02-21T08:16:36Z

I have also encountered the problem with propagating baggage via sns/sqs. I also put baggage into message attributes, only manually. Having this implemented in the instrumentation side would be great. @blumamir, did you have time to work on this?

I started working on it back ago but it needs a refactor greater than I planned. I don't remember all the details, but I think the instrumentation uses some mechanism only to handle span at request time and response time, where the injection needs to happen after span creation (so we have the span id to inject) but before the request is sent on the wire (so it can include the message attributes).

awelless · 2023-02-21T09:10:32Z

I started working on it back ago but it needs a refactor greater than I planned. I don't remember all the details, but I think the instrumentation uses some mechanism only to handle span at request time and response time, where the injection needs to happen after span creation (so we have the span id to inject) but before the request is sent on the wire (so it can include the message attributes).

You're right. Span creation takes place in afterMarshalling method when request body is not easy to modify. However, I reckon that it should be possible to create span in beforeExecution- it's the first hook and the docs say it's guaranteed that this method is executing on the same thread preserving all thread locals.
I could try to refactor this and move all span creation logic into another method, so it will allow us to have an access to all tracing data during request modification.
What do you think?

tylerbenson · 2023-03-13T20:33:48Z

I think this issue can be closed as Java implements this according to the spec:
https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/instrumentation/aws-lambda/#sqs-message

The main recent addition is #7970 which removes the AWS X-Ray environment variable from the propagation chain and allows the regular propagators to be applied.

Other languages may still require additional changes to implement the correct propagation.
See also open-telemetry/opentelemetry-specification#3212.

mateuszrzeszutek · 2023-03-14T11:38:16Z

Closing. Thanks!

…7970) open-telemetry/opentelemetry-specification#3166 Per discussion in the FAAS SIG, we decided that the AWS X-Ray environment variable should be moved to a span link to avoid interfering with the configured propagators. Also related to #5167.

blumamir added the enhancement New feature or request label Jan 18, 2022

mateuszrzeszutek closed this as completed Mar 14, 2023

tylerbenson mentioned this issue Mar 14, 2023

Move X-Ray Env Variable propagation to span link instead of parent #7970

Merged

blumamir mentioned this issue Mar 20, 2023

feat: SDK Extension AWS Gem open-telemetry/opentelemetry-ruby-contrib#386

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-sdk SQS propagation #5167

aws-sdk SQS propagation #5167

blumamir commented Jan 18, 2022

anuraaga commented Jan 18, 2022

blumamir commented Jan 20, 2022

awelless commented Feb 20, 2023

blumamir commented Feb 21, 2023

awelless commented Feb 21, 2023 •

edited

Loading

tylerbenson commented Mar 13, 2023

mateuszrzeszutek commented Mar 14, 2023

aws-sdk SQS propagation #5167

aws-sdk SQS propagation #5167

Comments

blumamir commented Jan 18, 2022

NodeJS \ Ruby

Cons

Java

Cons

Action Items

anuraaga commented Jan 18, 2022

blumamir commented Jan 20, 2022

awelless commented Feb 20, 2023

blumamir commented Feb 21, 2023

awelless commented Feb 21, 2023 • edited Loading

tylerbenson commented Mar 13, 2023

mateuszrzeszutek commented Mar 14, 2023

awelless commented Feb 21, 2023 •

edited

Loading