Improve store timeouts #1789

IKSIN · 2019-11-26T10:52:47Z

We want to increase stability Thanos infrastructure with many stores/sidecars (49 in our case) when some stores respond slowly.

Changes

Fix and improve TestProxyStore_SeriesSlowStores:
- Fixed broken test
- Add some use-cases
- check on elapsed time on get data from stores.
Change logic in ProxyStore:
- store response timeout moved from MergedSet.Next() function to goroutine in startStreamSeriesSet to check timeouts in parallel
- add ProxyStore metrics

More details in comments.

Verification

Tests
On production in our system with 49 store backends already 2 weeks.

IKSIN · 2019-11-26T10:59:13Z

Old ProxyStore logic:

New ProxyStore logic:

IKSIN · 2019-11-26T11:26:10Z

This PR fixes the case when 2 or more stores are responding slowly.
It is the case if, for example,
a) common S3 storage degraded -> all stores become slow
b) multiple prom instances deployed to the same host and the host is under high CPU pressure.
new messages

This PR also fixes double timeout in case of warnings

Also, we’ve separated RT metrics with/without payload:
before this change we’ve seen pretty low RT for all stores because most of queries doesn’t return any data. But if store return any payload - RT usually is much more.

This PR change works well in most of cases because if store responding slowly - it’s usually since first data chunk.

Example:
You have 3 stores backed by Ceph and 3 fast proms/sidecars
Ceph is degraded and responding very slowly (~infinite time).
Query timeout = 2m, response timeout = 10s
As it was before:
We will wait at least 10s X 3 because we’ve started response timeout timer inside MergedSet -> Next()
Now:
We will wait at least 10s, because timeout works in parallel way for all stores.

d-ulyanov · 2019-11-27T10:22:47Z

GiedriusS · 2019-11-27T10:52:30Z

Seems like you have rebased this on master but you have recreated all of the commits with you as the author. Could we please fix this and only leave the proper commits before a review?

d-ulyanov · 2019-11-27T11:02:53Z

@GiedriusS sure

bwplotka

Nice, I think I like it, but would be nice to have a commit title to be open about what we fix which is:

client timeout was only used when Next for corresponding store was used, which might after another slow store.

However, I have some comments and suggestions (:

bwplotka · 2019-11-26T19:55:50Z

pkg/store/proxy.go

@@ -41,6 +41,20 @@ type Client interface {
 	Addr() string
 }

+const WITH_PAYLOAD_LABEL = "with_payload"


Go constant variables should be still camel case (:

bwplotka · 2019-11-27T11:21:40Z

pkg/store/proxy.go

+func newProxyStoreMetrics(reg *prometheus.Registry) *proxyStoreMetrics {
+	var m proxyStoreMetrics
+
+	m.firstRecvDuration = prometheus.NewSummaryVec(prometheus.SummaryOpts{


Should we really use summaries here? (: Can we switch to histograms, maybe?

bwplotka · 2019-11-27T11:22:26Z

pkg/store/proxy.go

+
+	m.timeoutRecvCount = prometheus.NewCounterVec(prometheus.CounterOpts{
+		Name: "thanos_proxy_timeout_recv_count",
+		Help: "Timeout recv count.",


Not really helpful help (:

bwplotka · 2019-11-27T11:22:48Z

pkg/store/proxy.go

+
+	m.firstRecvDuration = prometheus.NewSummaryVec(prometheus.SummaryOpts{
+		Name:       "thanos_proxy_first_recv_duration",
+		Help:       "Time to get first part data from store(ms).",


Probably should be float64 seconds

Time to first byte is what we can call it

bwplotka · 2019-11-27T11:24:20Z

pkg/store/proxy.go

+		Help:       "Time to get first part data from store(ms).",
+		Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
+		MaxAge:     2 * time.Minute,
+	}, []string{"store", "payload"})


We talked about store in metrics - this might leak cardinaltity (changing IP address), so I think we have to hook it to external labels and store TYPE as we do in storeset.

We have too large external labels in our thanos-store's, therefore external labels is not comfortable. Also external_labels can have different order time from time. Maybe we can think about comfortable way to identify stores? For example, by bucket_name?

See storeset, we already do that (we also sort it), so if that's the problem we need to fix it everywhere (:

I can't imagine comfortable work with such external_labels, presented as json string =)

Can you show querier /metrics page?

In this case, we need to fix this in separate PR `stores hash as you already have metric with this label:

thanos/pkg/query/storeset.go

Line 94 in 7e11afe

[]string{"external_labels", "store_type"}, nil,

(:

thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-0",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-0",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-0",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-1",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-1",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-2",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-2",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-3",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-3",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-4",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-4",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-5",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-5",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-0",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-1",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-1",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-2",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-2",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-3",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-3",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-4",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-4",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-5",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-5",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-hardware-0",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-hardware-0",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-hardware-0",prometheus_replica="B"},{dc="mts",env="production",prometheus_replica="prometheus01z1.h.o3.ru"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-hardware-0",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-0",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-0",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-0",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-1",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-1",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-other-0",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-other-0",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-system-0",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-system-0",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-0",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-1",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-1",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-other-0",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-other-0",prometheus_replica="B"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-system-0",prometheus_replica="A"}"} 1
thanos_store_node_info{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-system-0",prometheus_replica="B"}"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-0",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-0",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-0",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-1",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-1",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-2",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-2",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-3",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-3",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-4",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-4",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-5",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-apps-5",prometheus_replica="B"}",store_type="store"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-0",prometheus_replica="B"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-1",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-1",prometheus_replica="B"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-2",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-2",prometheus_replica="B"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-3",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-3",prometheus_replica="B"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-4",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-4",prometheus_replica="B"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-5",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-apps-5",prometheus_replica="B"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-hardware-0",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-hardware-0",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-hardware-0",prometheus_replica="B"},{dc="mts",env="production",prometheus_replica="prometheus01z1.h.o3.ru"}",store_type="store"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-hardware-0",prometheus_replica="B"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-0",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-0",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-0",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-1",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-1",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-other-0",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-other-0",prometheus_replica="B"},{dc="m9",env="production",prometheus="monitoring/prometheus-system-0",prometheus_replica="A"},{dc="m9",env="production",prometheus="monitoring/prometheus-system-0",prometheus_replica="B"}",store_type="store"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-0",prometheus_replica="B"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-1",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-ingress-1",prometheus_replica="B"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-other-0",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-other-0",prometheus_replica="B"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-system-0",prometheus_replica="A"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{dc="m9",env="production",prometheus="monitoring/prometheus-system-0",prometheus_replica="B"}",store_type="sidecar"} 1

All metrics with external_labels. It looks ugly and unreadable, also json string to hard to process in grafana (for naming graphs)

It's crazy!

I think, still this should be resolved outside of this PR (:

Again, I think address might add some cardinality, but I am tempted to allow it under some flag...

bwplotka · 2019-11-27T11:38:32Z

pkg/store/proxy.go

+					if r.Size() > 0 {
+						metrics.withPayload.Observe(time.Since(t0).Seconds())
+					} else {
+						metrics.withoutPayload.Observe(time.Since(t0).Seconds())


Can we just skip the recv without payload? We care about time to first byte IMO

We observe strange data from this metrics (we set 10s timeout):

Therefore, we want to save without_payload metrics, for investigate problems in future.

Wait, this is because it's EOF which we should filter out. This the normal response when the server closes the stream, it should be filtered out. This is also received when context is canceled or timeout was triggered that's why you see 10s I think.

Sorry, I don't understand... I think that we must to see minimal RT for first byte without payload...

No, I EOF is when server closes a stream, but we account that as first byte without payload - I believe that's wrong

I think that without_payload == EOF on first byte and it's represetnatine metric...

Could you add a check for err here before sending it back? It probably is equal to io.EOF in such case and it should be filtered out. In fact, on line 448 there is: if rr. err == io.EOF { ... }. We probably shouldn't check for this in two places as well.

I don't see a reasons to check on some errors here. All errors processed out of this goroutine. This goroutine needed only for get data async.

I also think we don't need without_payload metrics, let's just have time_to_first_byte metrics. What are you going to differently when you see without_payload vs with_payload metrics increasing?

I agree here. This metric roughly says "time until a node tells us that it doesn't have any metrics". In what cases do you think it could be useful? payload is a bit misleading as well, IMHO. Payload is "the actual information or message in transmitted data" and even if we get EOF it still is some kind of data. Even if we will keep this around I'd suggest renaming this to perhaps response_kind that could be no_metrics or metrics.

bwplotka · 2019-11-27T11:42:28Z

pkg/store/proxy.go

+				ctx, cancel = context.WithTimeout(ctx, s.responseTimeout)
+				defer cancel()
+			}
+			rCh := make(chan *recvResponse, 1)


Instead of allocating this every time we can reuse this channel

We need to wait last recv on timeout for close channel for send unused data frame, therefore we can't reuse firs channel

bwplotka · 2019-11-27T11:45:11Z

pkg/store/proxy.go

 		for {
-			r, err := s.stream.Recv()
+			var cancel context.CancelFunc
+			if s.responseTimeout != 0 {


Just curious: can we have this timeout for a whole response? Do we really need it per frame? Plus we might allocate a lot here so canceling per stream frame would be nice.

If we want to use timeout for a whole response:

We must to set such timeout ~query response. in this case we lose opportunity to fast cancel response from slow store (for example)

We need to increase channel buffer (for real parallel read from stores)

Yes, you are right all is connected and dependent.

I actually think we should increase buffer at some point, to some degree , but not here. Let's keep it per frame.

It would be nice to have timeout naming adjusted then to mention frame right now it is:

If a Store doesn't send any data in this specified duration then a Store will be ignored and partial data will be returned if it's enabled. 0 disables timeout.

Maybe to:

If a Store doesn't send any frame in this specified duration then a Store will be ignored and partial data will be returned if it's enabled. 0 disables timeout.

I try to increase buffer to fit all response.
Increase buffer affected only usage memory, but not increase RT, independed on slow/fast stores )

In my opinion, this memory will be buffered anyway at some point, but anyway, let's not introduce this here. I think we all like this PR in such logic: To make sure we timeout on the first byte from the slow store instead precisely.

bwplotka · 2019-11-27T11:48:13Z

pkg/store/proxy.go

+			case rr = <-rCh:
+			}
+			close(rCh)
+			err := rr.err


Can we use those directly? Do we need those local variables?

pkg/store/proxy.go

bwplotka · 2019-11-27T17:16:44Z

I think I addressed all the comments (:

bwplotka · 2019-12-03T13:37:48Z

I think this PR makes sense, just some suggestions, any movement here? (:

d-ulyanov · 2019-12-03T21:47:02Z

I suppose we'll continue with this PR on next week, @IKSIN is on vacations currently :)

IKSIN · 2019-12-13T13:05:26Z

PR updated

GiedriusS · 2019-12-15T21:56:58Z

pkg/store/proxy.go

+				queryTimeoutCount: s.metrics.queryTimeoutCount.WithLabelValues(st.LabelSetsString(), storeTypeStr),
+			}
+
+			seriesSet = append(seriesSet, startStreamSeriesSet(seriesCtx, s.logger, closeSeries,


This should be under the comment, no? (:

No, It's not to be commented. Maybe you mean add a comment?

There is a comment on the 317 line:

// Schedule streamSeriesSet that translates gRPC streamed response ...

Could this be moved under that comment or removed?

Moved under the comment

pkg/store/proxy.go

GiedriusS · 2019-12-15T22:11:09Z

pkg/store/proxy.go

+			frameTimeoutCtx := context.Background()
+			var cancel context.CancelFunc
+			if s.responseTimeout != 0 {
+				frameTimeoutCtx, cancel = context.WithTimeout(frameTimeoutCtx, s.responseTimeout)


Could we move out the construction of this context out of this function? It should probably make things easier to understand. WDYT, @bwplotka? This part is really becoming complex :(

What are your thoughts on this, @IKSIN ?

moved to function

pkg/store/proxy.go

IKSIN · 2020-01-13T14:32:18Z

updated

IKSIN · 2020-02-06T10:54:12Z

@povilasv @bwplotka Any comments?

povilasv

It looks ok to me. Thanks for the work 🥇 Let's wait for other maintainers opinions

bwplotka

Super nice I love it!

I have some suggestions though, but it looks almost ready to go ❤️
Thanks!

BTW: It was super nice to see you at FOSDEM! (: 👍

pkg/store/proxy.go

bwplotka · 2020-02-06T15:25:26Z

pkg/store/proxy.go

+	err error
+}
+
+func startFrameCtx(responseTimeout time.Duration) (context.Context, context.CancelFunc) {


Suggested change

func startFrameCtx(responseTimeout time.Duration) (context.Context, context.CancelFunc) {

func frameCtx(responseTimeout time.Duration) (context.Context, context.CancelFunc) {

bwplotka · 2020-02-06T15:25:55Z

pkg/store/proxy.go

@@ -384,14 +394,34 @@ func startStreamSeriesSet(
 			}
 		}()
 		for {
-			r, err := s.stream.Recv()
+			frameTimeoutCtx, cancel := startFrameCtx(s.responseTimeout)
+			if cancel != nil {


just defer cancel always and return func() {} not nil

bwplotka · 2020-02-06T15:27:41Z

pkg/store/proxy.go

+			}
+			rCh := make(chan *recvResponse, 1)
+			var rr *recvResponse
+			go func() {


So we care about the first frame right? or timeout for all frames?

In any case I think we need to start this go routine before for loop and have here for loop as well.

This will make sure that we only have one 2 go routines running: one waiting for recv or context cancel, second for reading.

Current implementation will constantly allocate new channel and go routine. For 1000 frames x 100 concurrent queries this might matter.

I wish we have benchmarks for querier ))): Like we do for Store now:

thanos/pkg/store/bucket_test.go

Line 1071 in 88f6be8

func BenchmarkSeries(b *testing.B) {

Actually added issue: #2105

bwplotka · 2020-02-06T15:35:12Z

pkg/store/proxy.go

+func (s *streamSeriesSet) timeoutHandling(isQueryTimeout bool, ctx context.Context) {
+	var err error
+	if isQueryTimeout {
+		err = errors.Wrap(ctx.Err(), fmt.Sprintf("failed to receive any data from %s", s.name))


What about just passing error to propagate instead here? and rename method to handleErr?

Signed-off-by: Aleskey Sin <asin@ozon.ru>

Signed-off-by: Aleskey Sin <leks.sin@gmail.com>

IKSIN · 2020-02-13T18:28:43Z

@bwplotka PR updated) w/o benchmarks(

pkg/store/proxy.go

Signed-off-by: Aleskey Sin <leks.sin@gmail.com>

bwplotka · 2020-02-14T16:43:23Z

Looking!

povilasv

👍

bwplotka

LGTM, just small nits.

bwplotka · 2020-02-17T13:07:03Z

pkg/store/proxy.go

@@ -383,78 +393,79 @@ func startStreamSeriesSet(
 				emptyStreamResponses.Inc()


I don't we want to increment this if the context was actually cancelled.. right?

Yes, therefore numResponses++ only on recv processed.

bwplotka · 2020-02-17T13:10:32Z

pkg/store/proxy.go

+			select {
+			case <-ctx.Done():
+				close(done)
+				err = errors.Wrap(ctx.Err(), fmt.Sprintf("failed to receive any data from %s", s.name))


Why separate err variable?

also can we use Wrapf instead of sprintf?

bwplotka · 2020-02-17T13:10:45Z

pkg/store/proxy.go

+				return
+			case <-frameTimeoutCtx.Done():
+				close(done)
+				err = errors.Wrap(frameTimeoutCtx.Err(), fmt.Sprintf("failed to receive any data in %s from %s", s.responseTimeout.String(), s.name))


bwplotka · 2020-02-17T13:10:55Z

pkg/store/proxy.go

 				return
 			}

+			if rr.err != nil {
+				wrapErr := errors.Wrapf(rr.err, "receive series from %s", s.name)


Why another var? (: We can inline this.. not a blocker though.

bwplotka · 2020-02-17T13:14:35Z

pkg/store/proxy.go

+			if rr.err != nil {
+				wrapErr := errors.Wrapf(rr.err, "receive series from %s", s.name)
+				s.handleErr(wrapErr)
+				close(done)


can we close done in handleErr?

Signed-off-by: Aleskey Sin <leks.sin@gmail.com>

IKSIN · 2020-02-18T12:11:51Z

@bwplotka Updated!)

bwplotka

Let's go! 💪

Thank you for this!

bwplotka · 2020-02-18T15:21:30Z

pkg/store/proxy.go

 				return
 			}

+			if rr.err != nil {
+				wrapErr := errors.Wrapf(rr.err, "receive series from %s", s.name)


Why another var? (: We can inline this.. not a blocker though.

d-ulyanov · 2020-02-18T18:02:16Z

Hooray! :)
Thank you for review, @bwplotka @GiedriusS @povilasv !
🍻

* Improve proxyStore timeouts. Signed-off-by: Aleskey Sin <asin@ozon.ru> * Fix send to closed channel. Signed-off-by: Aleskey Sin <leks.sin@gmail.com> * Update for PR. Signed-off-by: Aleskey Sin <leks.sin@gmail.com> * Fix recv done channel. Signed-off-by: Aleskey Sin <leks.sin@gmail.com> * PR fixes. Signed-off-by: Aleskey Sin <leks.sin@gmail.com>

IKSIN force-pushed the slow_tests branch from 50bfbd5 to ffeaef3 Compare November 26, 2019 10:54

IKSIN force-pushed the slow_tests branch from 2b86a72 to 9f2f125 Compare November 26, 2019 11:06

GiedriusS requested review from bwplotka, povilasv and GiedriusS November 27, 2019 10:52

IKSIN closed this Nov 27, 2019

IKSIN mentioned this pull request Nov 27, 2019

Improve ProxyStore timeouts #1804

Closed

9 tasks

bwplotka reviewed Nov 27, 2019

View reviewed changes

IKSIN reopened this Nov 27, 2019

IKSIN force-pushed the slow_tests branch from a97de52 to 7977409 Compare November 27, 2019 14:24

IKSIN force-pushed the slow_tests branch 2 times, most recently from 780cc1d to 0a287a3 Compare December 13, 2019 10:49

IKSIN requested a review from bwplotka December 13, 2019 13:05

GiedriusS reviewed Dec 15, 2019

View reviewed changes

pkg/store/proxy.go Outdated Show resolved Hide resolved

GiedriusS reviewed Dec 15, 2019

View reviewed changes

povilasv reviewed Jan 7, 2020

View reviewed changes

pkg/store/proxy.go Outdated Show resolved Hide resolved

IKSIN force-pushed the slow_tests branch from ec64370 to f4c264c Compare January 13, 2020 10:53

IKSIN requested review from GiedriusS and povilasv January 15, 2020 08:57

IKSIN force-pushed the slow_tests branch 2 times, most recently from cca28ff to b5313d5 Compare February 3, 2020 10:29

IKSIN requested a review from povilasv February 3, 2020 11:18

povilasv approved these changes Feb 6, 2020

View reviewed changes

bwplotka requested changes Feb 6, 2020

View reviewed changes

IKSIN force-pushed the slow_tests branch 4 times, most recently from b6b13a1 to d850c92 Compare February 13, 2020 16:10

Aleskey Sin and others added 3 commits February 13, 2020 19:11

Improve proxyStore timeouts.

d3db28f

Signed-off-by: Aleskey Sin <asin@ozon.ru>

Fix send to closed channel.

e260217

Signed-off-by: Aleskey Sin <leks.sin@gmail.com>

Update for PR.

9c6412c

Signed-off-by: Aleskey Sin <leks.sin@gmail.com>

IKSIN force-pushed the slow_tests branch from d850c92 to 9c6412c Compare February 13, 2020 16:36

IKSIN requested a review from bwplotka February 13, 2020 18:28

povilasv reviewed Feb 14, 2020

View reviewed changes

pkg/store/proxy.go Outdated Show resolved Hide resolved

Fix recv done channel.

ec2441d

Signed-off-by: Aleskey Sin <leks.sin@gmail.com>

IKSIN requested a review from povilasv February 14, 2020 09:01

povilasv approved these changes Feb 17, 2020

View reviewed changes

bwplotka approved these changes Feb 17, 2020

View reviewed changes

PR fixes.

6fcb677

Signed-off-by: Aleskey Sin <leks.sin@gmail.com>

IKSIN force-pushed the slow_tests branch from c382bda to 6fcb677 Compare February 18, 2020 11:50

bwplotka approved these changes Feb 18, 2020

View reviewed changes

bwplotka merged commit a354bfb into thanos-io:master Feb 18, 2020

GiedriusS mentioned this pull request Apr 12, 2020

store: proxy: fix queries never timing out bug #2411

Merged

	func startFrameCtx(responseTimeout time.Duration) (context.Context, context.CancelFunc) {
	func frameCtx(responseTimeout time.Duration) (context.Context, context.CancelFunc) {

		@@ -383,78 +393,79 @@ func startStreamSeriesSet(
		emptyStreamResponses.Inc()

Improve store timeouts #1789

Improve store timeouts #1789

Conversation

IKSIN commented Nov 26, 2019

Changes

Verification

IKSIN commented Nov 26, 2019 • edited Loading

IKSIN commented Nov 26, 2019

d-ulyanov commented Nov 27, 2019

GiedriusS commented Nov 27, 2019

d-ulyanov commented Nov 27, 2019

bwplotka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IKSIN Nov 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IKSIN Nov 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GiedriusS Dec 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwplotka commented Nov 27, 2019

bwplotka commented Dec 3, 2019

d-ulyanov commented Dec 3, 2019

IKSIN commented Dec 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IKSIN commented Jan 13, 2020

IKSIN commented Feb 6, 2020

povilasv left a comment

Choose a reason for hiding this comment

bwplotka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IKSIN commented Feb 13, 2020

bwplotka commented Feb 14, 2020

povilasv left a comment

Choose a reason for hiding this comment

bwplotka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IKSIN commented Nov 26, 2019 •

edited

Loading

IKSIN Nov 27, 2019 •

edited

Loading

IKSIN Nov 28, 2019 •

edited

Loading

GiedriusS Dec 15, 2019 •

edited

Loading