-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test trace sampling #193
test trace sampling #193
Conversation
could I get some comments @garypen @cecton @o0Ignition0o @BrynCooke ? Not really on the code (I'll scrap that implementation), more on the ideas around sampling traces |
From your needs. I like point 1,2 and 4. I have some concerns about 3. Could a bad actor (client) send in sampling orders that result in a DoS? As long as we sanitize client header configuration, that may be ok, but I'd still have concerns about overriding server side configuration this way. I would prefer to not expose sampling rates as configuration options. After all, this is something we should know and don't really want to make customers decide. I like categorising between success/error and I like declaring sampling rates. |
Maybe an external configuration tool can allow sampling for a specific time according to a specific set of rules? I definitely see a usecase where support / solution teams would like to enable sampling all of the requests for a given client (by IP / userid or something), to get extra observability for a limited period of time. In order to avoid ddos we could rate limit the requests with the sampling header, or enable it for up to 5 minutes (or both?). I like us providing options to whichever application will steer us. Bonus points because our state machine (with the watcher) almost allows that already. |
Accepting sampling orders from the client is very useful for debugging: most requests can be sampled but some can be always traced if decided. Same for sampling rates: that's highly dependent on the application. I'd prefer reasonable defaults, ie not accepting client overrides on the sampling rates, and an automatic sampling strategy at first (maybe per query) |
(I arrived on this repository from the PR on There should always be the option for the application operator to:
Point 1. is a requirement to perform tail-based sampling, which usually leads to much better results in terms of relevance of the traces that actually make it to the tracing storage backend (see Refinery as an example). Point 2. is important when the router is publicly exposed. Letting clients of public APIs in charge of propagating their sampling choices can lead to denial of service scenarios (e.g. if they send a ton of requests all set to "sample this"). This was already raised, but just reinforcing it as I happened to witness this very scenario play out 😅 |
your experience is very welcomed :) |
I think I'll go with sampling rates in the configuration only for now, and we'll see in a follow up PR for client driven sampling |
Sounds like a good first step! Can we add this to the ADR, maybe in #213 ? |
a small setback for now: with tracing-opentelemetry, the sampling decision has to be made before the span ends: https://docs.rs/tracing-opentelemetry/latest/tracing_opentelemetry/trait.PreSampledTracer.html I'll look into making a filter elsewhere instead of using opentelemetry's sampling |
closing this, the work is happening in #224 |
related:
this is an exploration of trace sampling. Currently the custom sampler just prints its parameters. With the query
query ExampleQuery($topProductsFirst: Int) { me { id avis: reviews { body } } topProducts(first: $topProductsFirst) { name price inStock }}
it will print:Apparently it does not receive information that there was an error event inside a span. In its result, a sampler has a SamplingDecision:
RecordOnly is useful if we want to process the span (example: for statistics or SLO/SLA) but don't want to send it because we're sampling
The sampler can look at the parent span to see if it should be sampled. The sampling decision depends on the parent so here in the logs, we have three different traces (probably a bug, they should be merged).
The spec on sampled flag usage: https://www.w3.org/TR/trace-context/#sampled-flag
needs
The kind of sampling I'd like to integrate:
design
Context::current().span().set_attribute(key_value)
)