Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test trace sampling #193

Closed
wants to merge 1 commit into from
Closed

test trace sampling #193

wants to merge 1 commit into from

Conversation

Geal
Copy link
Contributor

@Geal Geal commented Nov 25, 2021

related:

this is an exploration of trace sampling. Currently the custom sampler just prints its parameters. With the query query ExampleQuery($topProductsFirst: Int) { me { id avis: reviews { body } } topProducts(first: $topProductsFirst) { name price inStock }} it will print:

{"timestamp":"Nov 25 11:14:54.182","level":"INFO","fields":{"message":"Starting Apollo Router\n*******************************************************************\n⚠️  Experimental software, not YET recommended for production use ⚠️\n****************************************
***************************"},"target":"router"}                   
{"timestamp":"Nov 25 11:14:54.659","level":"INFO","fields":{"message":"Listening on http://127.0.0.1:4000 🚀"},"target":"router"}
Nov 25 11:15:17.822 ERROR graphql_request: apollo_router::warp_http_server_factory: TEST ERROR stream_request
should_sample:                          
parent context: Some(Context { entries: 0 })                                                                                            
trace id: TraceId(271433600488164807808896833698919480999)
name:graphql_request                                                                                                                    
span_kind: Internal                               
attributes: [                                                                                                                           
    KeyValue {                                                                                                                          
        key: Key(                         
            "code.filepath",                                                                                                            
        ),                                                                                                                              
        value: String(    
            "crates/apollo-router/src/warp_http_server_factory.rs",                                                                                                                                                                                                             
        ),                                         
    },                                                                                                                                  
    KeyValue {                                                      
        key: Key(                                                                                                                       
            "code.namespace",                
        ),                                                                                                                                                                                                                                                                              value: String(                                                                                                                  
            "apollo_router::warp_http_server_factory",                                                                                                                                                                                                                          
        ),
    },
    KeyValue {
        key: Key(
            "code.lineno",
        ),
        value: I64(
            275,
        ),
    },
]
links: []

should_sample:                                     
parent context: Some(Context { entries: 0 })                                                                                            
trace id: TraceId(10389225646006697978543566376744025668)
name:execution                                                     
span_kind: Internal         
attributes: [                                                                                                                           
    KeyValue {                                                                                                                          
        key: Key(                                                                                                                       
            "code.filepath",                                  
        ),                                                                                                                              
        value: String(                                                                                                                                                                                                                                                          
            "crates/apollo-router-core/src/query_planner/model.rs",
        ),                                                                                                                              
    },                                                                                                                                  
    KeyValue {
        key: Key(                           
            "code.namespace",                             
        ),          
        value: String(
            "apollo_router_core::query_planner::model",
        ),    
    },           
    KeyValue {              
        key: Key(
            "code.lineno",
        ),                                                         
        value: I64(
            249,
        ),    
    },           
]                            
links: [] 
                                                                    
Nov 25 11:15:18.149 ERROR format_response{self=Query { string: "query ExampleQuery($topProductsFirst: Int) {\n  me {\n    id\n avis: reviews { body } \n }\n  topProducts(first: $topProductsFirst) {\n    name\n    price\n    inStock\n  }\n}\n" } response=Response { label: 
None, data: Object({"me": Object({"__typename": String("User"), "id": String("1"), "avis": Array([Object({"body": String("Love it!")}), Object({"body": String("Too expensive.")})])}), "topProducts": Array([Object({"__typename": String("Product"), "upc": String("1"), "name
": String("Table"), "price": Number(899), "inStock": Bool(true)}), Object({"__typename": String("Product"), "upc": String("2"), "name": String("Couch"), "price": Number(1299), "inStock": Bool(false)}), Object({"__typename": String("Product"), "upc": String("3"), "name": S
tring("Chair"), "price": Number(54), "inStock": Bool(true)})])}), path: None, has_next: None, errors: [], extensions: {} }}: apollo_router_core::request: TEST ERROR format_response 

should_sample:                                            
parent context: Some(Context { entries: 0 })                                                                                                                                                                                                                                    
trace id: TraceId(272298814962264935456849981390735724192)                                                                                                                                                                                                                      
name:format_response                                                                                                                                                                                                                                                            
span_kind: Internal                                                                                                                                                                                                                                                             
attributes: [                                                                                                                                                                                                                                                                   
    KeyValue {                                                                                                                                                                                                                                                                  
        key: Key(                                                                                                                                                                                                                                                               
            "code.filepath",                                                                                                                                                                                                                                                    
        ),                                                                                                                                                                                                                                                                      
        value: String(                                                                                                                                                                                                                                                          
            "crates/apollo-router-core/src/request.rs",                                                                                                                                                                                                                         
        ),                                                                                                                                                                                                                                                                      
    },                                                                                                                                                                                                                                                                          
    KeyValue {                                                                                                                                                                                                                                                                  
        key: Key(                                                                                                                                                                                                                                                               
            "code.namespace",                                                                                                                                                                                                                                                   
        ),                                                                                                                                                                                                                                                                      
        value: String(                                                                                                                                                                                                                                                          
            "apollo_router_core::request",                                                                                                                                                                                                                                      
        ),                                                                                                                                                                                                                                                                      
    },                                                                                                                                                                                                                                                                          
    KeyValue {                                                                                                                                                                                                                                                                  
        key: Key(                                                                                                                                                                                                                                                               
            "code.lineno",                                                                                                                                                                                                                                                      
        ),                                                                                                                                                                                                                                                                      
        value: I64(                                                                                                                                                                                                                                                             
            51,                                                                                                                                                                                                                                                                 
        ),                                                                                                                                                                                                                                                                      
    },                                                                                                                                                                                                                                                                          
    KeyValue {                                                                                                                                                                                                                                                                  
        key: Key(                                                                                                                                                                                                                                                               
            "self",                                                                                                                                                                                                                                                             
        ),                                                                                                                                                                                                                                                                      
        value: String(                                                                                                                                                                                                                                                          
            "Query { string: \"query ExampleQuery($topProductsFirst: Int) {\\n  me {\\n    id\\n avis: reviews { body } \\n }\\n  topProducts(first: $topProductsFirst) {\\n    name\\n    price\\n    inStock\\n  }\\n}\\n\" }",                                               
        ),                                                                                                                                                                                                                                                                      
    },                                                                                                                                                                                                                                                                          
    KeyValue {                                                                                                                                                                                                                                                                  
        key: Key(                                                                                                                                                                                                                                                               
            "response",                                                                                                                                                                                                                                                         
        ),                                                                                                                                                                                                                                                                      
        value: String(                                                                                                                                                                                                                                                          
            "Response { label: None, data: Object({\"me\": Object({\"__typename\": String(\"User\"), \"id\": String(\"1\"), \"avis\": Array([Object({\"body\": String(\"Love it!\")}), Object({\"body\": String(\"Too expensive.\")})])}), \"topProducts\": Array([Object({\"__t
ypename\": String(\"Product\"), \"upc\": String(\"1\"), \"name\": String(\"Table\"), \"price\": Number(899), \"inStock\": Bool(true)}), Object({\"__typename\": String(\"Product\"), \"upc\": String(\"2\"), \"name\": String(\"Couch\"), \"price\": Number(1299), \"inStock\": 
Bool(false)}), Object({\"__typename\": String(\"Product\"), \"upc\": String(\"3\"), \"name\": String(\"Chair\"), \"price\": Number(54), \"inStock\": Bool(true)})])}), path: None, has_next: None, errors: [], extensions: {} }",
        ),
    },
    KeyValue {                                                      
        key: Key(                                                                                                                       
            "busy_ns",                                   
        ),                                                       
        value: I64(                                       
            504336,                                             
        ),                                              
    },                                                                                                                                  
    KeyValue {                                 
        key: Key(                                            
            "idle_ns",                               
        ),                                                                                                                              
        value: I64(
            29067,                                                                                                                      
        ),                                                                                                                              
    },                                                              
]                                                                                                                                       
links: []                                                                                                                               
                                                                    
Nov 25 11:15:18.149 ERROR apollo_router::apollo_router: TEST ERROR

Apparently it does not receive information that there was an error event inside a span. In its result, a sampler has a SamplingDecision:

  • Drop: delete the span from the trace
  • RecordOnly: keep the span, but it is not sent
  • RecordAndSample: keep the span, send it

RecordOnly is useful if we want to process the span (example: for statistics or SLO/SLA) but don't want to send it because we're sampling
The sampler can look at the parent span to see if it should be sampled. The sampling decision depends on the parent so here in the logs, we have three different traces (probably a bug, they should be merged).
The spec on sampled flag usage: https://www.w3.org/TR/trace-context/#sampled-flag

needs

The kind of sampling I'd like to integrate:

  • send all traces where we saw an error (or at least most of them, we should be careful not to overwhelm the collection system during an incident)
  • sample successful queries, with a configurable rate (percentage based, time based, maybe per query hash, etc)
  • honor sampling orders sent by the client in a header
  • indicate in the trace the sample rate

design

  • we make a sampler like the parent based that defers the decision to the parent if it's sampled
  • our sampler looks at the attributes to see if there's one indicating it should be sampled (can be done with Context::current().span().set_attribute(key_value))
  • at the end of the response processing, if there was an error, we set the attribute for it

@Geal
Copy link
Contributor Author

Geal commented Nov 29, 2021

could I get some comments @garypen @cecton @o0Ignition0o @BrynCooke ? Not really on the code (I'll scrap that implementation), more on the ideas around sampling traces

@garypen
Copy link
Contributor

garypen commented Nov 29, 2021

could I get some comments @garypen @cecton @o0Ignition0o @BrynCooke ? Not really on the code (I'll scrap that implementation), more on the ideas around sampling traces

From your needs. I like point 1,2 and 4. I have some concerns about 3. Could a bad actor (client) send in sampling orders that result in a DoS? As long as we sanitize client header configuration, that may be ok, but I'd still have concerns about overriding server side configuration this way.

I would prefer to not expose sampling rates as configuration options. After all, this is something we should know and don't really want to make customers decide. I like categorising between success/error and I like declaring sampling rates.

@o0Ignition0o
Copy link
Contributor

Maybe an external configuration tool can allow sampling for a specific time according to a specific set of rules?

I definitely see a usecase where support / solution teams would like to enable sampling all of the requests for a given client (by IP / userid or something), to get extra observability for a limited period of time. In order to avoid ddos we could rate limit the requests with the sampling header, or enable it for up to 5 minutes (or both?).

I like us providing options to whichever application will steer us. Bonus points because our state machine (with the watcher) almost allows that already.

@Geal
Copy link
Contributor Author

Geal commented Nov 29, 2021

Accepting sampling orders from the client is very useful for debugging: most requests can be sampled but some can be always traced if decided.
This has to be configurable though, since as you said, a bad actor could abuse it.
This is not something we can decide definitely for the user though, that depends on their infrastructure.

Same for sampling rates: that's highly dependent on the application.

I'd prefer reasonable defaults, ie not accepting client overrides on the sampling rates, and an automatic sampling strategy at first (maybe per query)

@LukeMathWalker
Copy link

LukeMathWalker commented Nov 30, 2021

(I arrived on this repository from the PR on reqwest-tracing, therefore as a developer who spent too much time working on distributed tracing I am offering my 2c based on our experience, but feel free to ignore them entirely)

There should always be the option for the application operator to:

  • Retain 100% of the traces;
  • Ignore the sampling choices made by the remote parent of the current span.

Point 1. is a requirement to perform tail-based sampling, which usually leads to much better results in terms of relevance of the traces that actually make it to the tracing storage backend (see Refinery as an example).

Point 2. is important when the router is publicly exposed. Letting clients of public APIs in charge of propagating their sampling choices can lead to denial of service scenarios (e.g. if they send a ton of requests all set to "sample this"). This was already raised, but just reinforcing it as I happened to witness this very scenario play out 😅

@Geal
Copy link
Contributor Author

Geal commented Nov 30, 2021

your experience is very welcomed :)

@Geal
Copy link
Contributor Author

Geal commented Dec 1, 2021

I think I'll go with sampling rates in the configuration only for now, and we'll see in a follow up PR for client driven sampling

@o0Ignition0o
Copy link
Contributor

Sounds like a good first step! Can we add this to the ADR, maybe in #213 ?

@Geal Geal mentioned this pull request Dec 1, 2021
@Geal
Copy link
Contributor Author

Geal commented Dec 1, 2021

a small setback for now: with tracing-opentelemetry, the sampling decision has to be made before the span ends: https://docs.rs/tracing-opentelemetry/latest/tracing_opentelemetry/trait.PreSampledTracer.html

I'll look into making a filter elsewhere instead of using opentelemetry's sampling

@Geal Geal mentioned this pull request Dec 1, 2021
@Geal Geal self-assigned this Dec 1, 2021
@Geal
Copy link
Contributor Author

Geal commented Dec 8, 2021

closing this, the work is happening in #224

@Geal Geal closed this Dec 8, 2021
@Geal Geal deleted the test-sampling branch January 5, 2022 10:08
tinnou pushed a commit to Netflix-Skunkworks/router that referenced this pull request Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants