Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace sampling #228

Closed
Geal opened this issue Dec 1, 2021 · 2 comments · Fixed by #224
Closed

Trace sampling #228

Geal opened this issue Dec 1, 2021 · 2 comments · Fixed by #224
Assignees

Comments

@Geal
Copy link
Contributor

Geal commented Dec 1, 2021

related:

request tracing can be expensive, we should have a way to select which traces should be sent to jaeger or opentelemetry.

@Geal Geal added the triage label Dec 1, 2021
@Geal Geal self-assigned this Dec 1, 2021
@Geal Geal added 2021-11 and removed triage labels Dec 1, 2021
@abernix abernix added 2021-12 and removed 2021-11 labels Dec 3, 2021
@Geal
Copy link
Contributor Author

Geal commented Dec 8, 2021

Following up on the tests in #224 and #237, and after discussing it, the idea of adapting the sampling rate on the result (example: aggressive sampling for successful queries, no sampling for failed ones so we can receive them all) has tradeoffs we might not want to make yet: this requires that we do tail sampling, where we store all of the spans until the end of the query, and then decide if we want to send the trace.

This can result in high memory usage, especially with long running queries. This will also be difficult to get right.
So as a first phase, I'll set up tracing-opentelemetry's head-based sampling, with a rate defined in the configuration, where the sampling decision is done once at the beginning of the trace.
If needed, users will be able to change the configuration and reload it to adapt the rate for debugging

@Geal Geal closed this as completed in #224 Dec 13, 2021
@jwilm
Copy link

jwilm commented Dec 14, 2021

Hi @Geal,

I saw your issue link and thought I'd drop a note here. We ended up implementing a proof-of-concept tail sampling for tracing-opentelemetry. The code is available publicly, but we still consider it experimental. That said, we have it deployed as part of a high volume application, and it has been running without issue for about 1 week. Actually, we've had better memory utilization than without it due to fewer spans being sent out of the system.

To utilize the tail sampling layer, we internally added an extension method to the tracing::Span type that looks like this:

    fn drop_this_trace(&self) {
        dispatcher::get_default(|d| {
            let registry = d
                .downcast_ref::<Registry>()
                .expect("Subscriber is not Registry; this is a bug.");

            let span = d.current_span();
            let spanref = registry.span(&span.id().unwrap()).unwrap();
            let extensions = spanref.extensions();
            let trace_context = extensions.get::<TraceContext>().unwrap();
            let mut trace_ext = trace_context.trace.extensions_mut();
            trace_ext.insert(SampleDecision {
                record_trace: false,
            });
        });
    }

For now we are just calling this in our application code, but it should be possible to write a tracing layer to be inserted before the tracing-opentelemetry layer to look for span attributes having a particular status code (if I'm understanding your use case correctly) and adding the SamplingDecision to the trace context extensions as in the code snippet.

If you want to try it out, I'd be happy to answer any questions or take feedback you may have.

Cheers!

o0Ignition0o added a commit that referenced this issue Jan 11, 2022
# [v0.1.0-alpha.3] 2022-01-11

## 🚀🌒 Public alpha release

> An alpha or beta release is in volatile, active development. The release might not be feature-complete, and breaking API changes are possible between individual versions.

## ✨ Features

- Trace sampling [#228](#228): Tracing each request can be expensive. The router now supports sampling, which allows us to only send a fraction of the received requests.

- Health check [#54](#54)

## 🐛 Fixes

- Schema parse errors [#136](#136): The router wouldn't display what went wrong when parsing an invalid Schema. It now displays exactly where a the parsing error occurred, and why.

- Various tracing and telemetry fixes [#237](#237): The router wouldn't display what went wrong when parsing an invalid Schema. It now displays exactly where a the parsing error occurred, and why.

- Query variables validation [#62](#62): Now that we have a schema parsing feature, we can validate the variables and their types against the schemas and queries.
tinnou pushed a commit to Netflix-Skunkworks/router that referenced this issue Oct 16, 2023
there was an issue with recent attempted publishes due to a bad secret
rotation. this PR temporarily disables cargo publishes in our CI
pipelines while we create GitHub releases. It will be enabled once the
packages are properly released.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants