Skip to content

QueryFrontend: Return server side performance information on server-timing header #3685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 26, 2021

Conversation

alanprot
Copy link
Member

@alanprot alanprot commented Jan 13, 2021

What this PR does:

  • Creates a Util class that can be used on any handler to add server side performance information on server-timing http header
  • Return server-timing header with querier wall time information on query requests when -frontend.query-stats-enabled=true
  • Return server-timing header with request_duration on any request on query-frontend

We may want to create a config to enable this feature.

Result on chrome:

image

Which issue(s) this PR fixes:
Enhancement of #3595
Checklist

  • [✓] Tests updated
  • [✓] No Documentation needed
  • [✓] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@alanprot alanprot force-pushed the service-stats branch 3 times, most recently from a7015b0 to f625d77 Compare January 13, 2021 22:13
@alanprot alanprot changed the title Service stats Return server side performance information on server-timing header Jan 13, 2021
@alanprot alanprot changed the title Return server side performance information on server-timing header QueryFrontend: Return server side performance information on server-timing header Jan 13, 2021
@alanprot alanprot force-pushed the service-stats branch 3 times, most recently from e639cbc to 8148525 Compare January 14, 2021 17:55
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alanprot for working on this. I see a value in sending downstream this information.

However, I believe the ServiceStats is a bit over engineered for this. The Cortex philosophy is to start small and let the design evolve when new features (or changes) are introduced. In this PR you're basically introducing a framework to send generic data down to the client, while we only need to send 1 header. I would suggest to start small, remove ServiceStats and just inject the querier_wall_time with httpsnoop in the middleware. We can reconsider it and evolve the design if and when we'll need to send other information downstream.

@alanprot alanprot force-pushed the service-stats branch 10 times, most recently from 8f514a7 to 61d81d6 Compare January 22, 2021 02:55
@alanprot alanprot force-pushed the service-stats branch 4 times, most recently from e8ca8d5 to 55e0a7e Compare January 23, 2021 01:27
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work! The new design is more lightweight and LGTM. I left a couple of questions / comments, but no hard blocker. Thanks!

@@ -302,6 +302,13 @@ func runQueryFrontendTest(t *testing.T, cfg queryFrontendTestConfig) {
assert.Equal(t, model.Time(1595846750806), matrix[0].Values[2].Timestamp)
}

// No need to repeat the test on missing metric name for each user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// No need to repeat the test on missing metric name for each user.
// No need to repeat the test on Server-Timing header for each user.

@@ -88,7 +88,7 @@ func (f *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
if f.cfg.QueryStatsEnabled {
var ctx context.Context
stats, ctx = querier_stats.ContextWithEmptyStats(r.Context())
r = r.WithContext(ctx)
*r = *r.WithContext(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pstibrany Do you think it's safe? I do understand why we need to do it (given the current implementation) but I've mixed feelings about manipulating the original request. Not sure it's something we've already done in other places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very brittle... it may easily break if we insert another middleware between this handler, and ServerTimingMiddleware.

We could move generation of server timing header into this handler.

Alternatively we could move stats logging into the same middleware as server-timing header.

func ServerTimingMiddleware(next http.Handler, statsEnabled bool) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {

if !statsEnabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing it here, why don't we skip wrapping with ServerTimingMiddleware() at all when query stats are not enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

})
}

func writeTimingHeader(startTime time.Time, headerWritten *bool, headers http.Header, r *http.Request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a best practice, I would avoid manipulating an input parameter (headerWritten in this case). You can move the if headerWritten logic out side this function, before calling writeTimingHeader().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, make it a method of some type which will hold headerWritten value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


var (
headerWritten = false
startTime = time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already tracking the response time in transport.Handler. I'm wondering we should just store it in the query stats struct, so that the two values (the one logged and the one reported via HTTP header) perfectly match.

parts := make([]string, 0)
stats := querier_stats.FromContext(r.Context())
if stats != nil {
parts = append(parts, statsValue("querier_wall_time", stats.WallTime))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected unit for the duration in the Server-Timing specs? I'm asking cause we return nanoseconds and I'm wondering how it gets displayed by the browser dev tools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is formatted to millseconds in statsValue, which seems to be what spec is using. (Although spec is not very explicit about it, and just mentions 'ms' in examples. Weird.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should never directly read stats.WallTime but use stats.LoadWallTime().

},
Write: func(next httpsnoop.WriteFunc) httpsnoop.WriteFunc {
return func(p []byte) (int, error) {
n, err := next(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? If next is called first, it may start writing response, and then it will be too late to write timing header afterwards. (Unless we use Trailer, but I don't see that in this code)

w = httpsnoop.Wrap(w, hooks)
next.ServeHTTP(w, r)

writeTimingHeader(startTime, &headerWritten, w.Header(), r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing header after serving request seems to be too late. Under what conditions would one of WriteHeader or Write hooks not write the header already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would happen if Write or WriteHeader is not called on any handler in the chain.

parts := make([]string, 0)
stats := querier_stats.FromContext(r.Context())
if stats != nil {
parts = append(parts, statsValue("querier_wall_time", stats.WallTime))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is formatted to millseconds in statsValue, which seems to be what spec is using. (Although spec is not very explicit about it, and just mentions 'ms' in examples. Weird.).

}

func statsValue(name string, d time.Duration) string {
return name + ";dur=" + strconv.FormatInt(d.Milliseconds(), 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Spec allows value to be float, not just integer, if you want to use higher precision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

})
}

func writeTimingHeader(startTime time.Time, headerWritten *bool, headers http.Header, r *http.Request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, make it a method of some type which will hold headerWritten value.

@@ -88,7 +88,7 @@ func (f *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
if f.cfg.QueryStatsEnabled {
var ctx context.Context
stats, ctx = querier_stats.ContextWithEmptyStats(r.Context())
r = r.WithContext(ctx)
*r = *r.WithContext(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very brittle... it may easily break if we insert another middleware between this handler, and ServerTimingMiddleware.

We could move generation of server timing header into this handler.

Alternatively we could move stats logging into the same middleware as server-timing header.

@alanprot alanprot force-pushed the service-stats branch 2 times, most recently from b75a6de to beced59 Compare January 25, 2021 21:51
…er-timing header)

Signed-off-by: alanprot <approtas@amazon.com>
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for addressing our feedback! The changeset now looks way more clean!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants