Skip to content

Add AWS X-Ray Adaptive Sampling Support #2147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

majanjua-amzn
Copy link

Description:
Adding support for X-Ray's new Adaptive Sampling feature. The feature will allow users to detect anomalies in their application across distributed services and boost the sampling rate of their root service based on their X-Ray sampling rule configuration and optionally a user-provided SDK level configuration. It will also allow users to optionally provide an error capture configuration, where - if configured - the sampler will send unsampled anomaly spans to be exported directly:

  • Configure a max boost rate and boost cooldown in their X-Ray sampling rules that defines how high the sampling rate can go for a given rule based on anomalies detected in the instrumentation and for how long
  • Configure anomaly conditions based on error code (RegEx), latency, and span operation - these will be used to check if a given span is an anomaly. By default, spans with a statusCode > 499 (i.e. 5XX) will be considered anomalies
  • Configure a anomaly/error capture rate that allows spans to be sent directly to the configured span exporter.

The changes:

  • Provide APIs called setSpanExporter and setAdaptiveSamplingConfig to set up the feature - if these are not provided any attempt to use the adaptSampling API will throw an IllegalStateException
  • Provide an API called adaptSampling that accepts a span and its associated spanData:
    • This API performs the necessary logic to determine whether the span is an anomaly based on the user-provided conditions (or default 5XX) and makes the decision whether to count it towards the boost-related statistics and/or whether to send it directly for export based on the error capture rate
  • Update the calls to GetSamplingRules and GetSamplingTargets according to the new API and the collected anomaly statistics
  • Propagate sampling information between instrumented services - specifically, the sampling rule in the root service is passed to all downstream services via the trace state, such that statistics can be recorded meaningfully for boost to be triggered for the root service in a distributed system

Testing:
Re-testing needs to be done based on latest changes in aws-otel-java-instrumentation once ready, hence the need for the draft revision of the PR

  • Manual testing: Application launched on fresh EC2 instance with 3 services
  • Performance testing: Revealed no difference between

Documentation:
TBD - hence draft PR

Outstanding items:
These changes are the first phase and iteration of AWS X-Ray's adaptive sampling feature. As we get feedback, more changes may be introduced to improve or streamline the experience.

Copy link

linux-foundation-easycla bot commented Aug 20, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

SamplingResult result =
applier.shouldSample(parentContext, traceId, name, spanKind, attributes, parentLinks);

String ruleToPropagate =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It is readable to update line 134-150:

  1. if there is rule from upstream, return upstream's rule;
  2. if adaptiveSamplingConfig, return current rule;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. Splitting the two cases would require some code duplication or turning the logic into a function which doesn't really make sense to me here.

for (AwsXrayAdaptiveSamplingConfig.AnomalyConditions condition : anomalyConditions) {
// Check if the operation matches any in the list or if operations list is null (match all)
List<String> operations = condition.getOperations();
if (!(operations == null || operations.isEmpty() || operations.contains(operation))) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is operation(s) in condition required?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

}
}
}
if (shouldBoostSampling && shouldCaptureAnomalySpan) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance, maybe if shouldBoostSampling or shouldCaptureAnomalySpan is true we can skip the remaining BoostSampling or CaptureAnomaly check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

boolean isLocalRootSpan =
parentContext == null || !parentContext.isValid() || parentContext.isRemote();

if (shouldBoostSampling || shouldCaptureAnomalySpan || isLocalRootSpan) {
Copy link

@wangzlei wangzlei Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability, not sure whether the "if" here is useful since you have
// Anomaly Capture, Sampling Boost below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split // Anomaly Capture from // Sampling Boost section. Still need shouldBoostSampling || isLocalRootSpan together since they both trigger the same loop logic that finds the ruleToReportTo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants