Panic errors in query_range calls #329

franek1709 · 2020-07-19T09:22:54Z

We use Promxy with VictoriaMetrics as metrics storage. After VictoriaMetrics update to latest version (1.38.1) we encountered problems with query_range calls - Promxy started to crash during execution of such queries. We found following entry in Promxy log:

panic: runtime error: index out of range [0] with length 0

goroutine 66626 [running]:
github.com/jacksontj/promxy/pkg/promhttputil.MergeSampleStream(0x0, 0xc001b19480, 0xc001377fe0, 0xc002100710, 0x1, 0x0)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/promhttputil/merge.go:189 +0x7b9
github.com/jacksontj/promxy/pkg/promhttputil.MergeValues.func2(0xc001377fe0)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/promhttputil/merge.go:150 +0xcb
github.com/jacksontj/promxy/pkg/promhttputil.MergeValues(0x0, 0x270b8a0, 0xc001e3fc40, 0x270b8a0, 0xc001498220, 0x1, 0xc000ef9260, 0x2744ce0, 0xc000455ab0)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/promhttputil/merge.go:162 +0x282
github.com/jacksontj/promxy/pkg/promclient.(*MultiAPI).QueryRange(0xc000997720, 0x2733f60, 0xc001111580, 0xc000268a00, 0x32, 0x0, 0xed6a5ee47, 0x3940c20, 0x0, 0xed6a5ee83, ...)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/promclient/multi_api.go:410 +0x7d8
github.com/jacksontj/promxy/pkg/promclient.(*TimeTruncate).QueryRange(0xc0003487e0, 0x2733f60, 0xc001111580, 0xc000268a00, 0x32, 0x0, 0xed6a5ee47, 0x3940c20, 0x0, 0xed6a5ee83, ...)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/promclient/time_truncate.go:34 +0x1cd

These crashes occur every few seconds. Luckily we started VictoriaMetrics update from our dev environment so we were able to discover it before going to production :) We use Promxy 0.0.57 but we also tried the latest version in our setup - crashes still occur. We also tried to inspect failing queries - unfortunately crashes looks nondeterministic. We tried to execute failing queries manually and we were not able to reproduce restarts.

Example of one of failing queries:

/api/v1/query_range?query=sum(process_descriptors_usage{service="test-service"}) by (instance, port)&end=1595148309&start=1595148249&step=60s

The text was updated successfully, but these errors were encountered:

… datapoints from /api/v1/query_range This matches Prometheus behaviour. This should fix jacksontj/promxy#329

Fixes #329

jacksontj · 2020-07-20T18:10:20Z

Thanks for the report! It seems that VM has fixed the issue on their side, but this is definitely something promxy should handle. I have a fix in #331 so it won't panic in that situation.

Fixes #329

franek1709 · 2020-07-20T20:48:41Z

Thanks for response and for fixing my issue! I am going to test it tomorrow. BTW: Is there any way to enable more detailed logs to make debugging similar problems easier? In this case only stack trace was logged (I set log level to debug). I had a suspicion there is a problem with parsing some responses returned from VictoriaMetrics - but without actual response logs I didn't find a way to prove it.

jacksontj · 2020-07-20T21:04:36Z

If you enable trace level logging it will log the parsed output from the downstreams (in this case it would have shown that one had no points in it); but getting more detailed than that would be difficult as the upstream client lib (from prometheus) doesn't log anything else :/ So far the trace level logging has been sufficient for most issues I've seen.

franek1709 · 2020-07-21T18:06:03Z

I tested new released version. Unfortunately problem still occurs. It's quite strange that stacktrace looks similar:

panic: runtime error: index out of range [0] with length 0

goroutine 12045 [running]:
github.com/jacksontj/promxy/pkg/promhttputil.MergeSampleStream(0x0, 0xc0012d2c80, 0xc000f55b60, 0xc001db5450, 0x1, 0x0)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/promhttputil/merge.go:189 +0x7b9
github.com/jacksontj/promxy/pkg/promhttputil.MergeValues.func2(0xc000f55b60)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/promhttputil/merge.go:150 +0xcb
github.com/jacksontj/promxy/pkg/promhttputil.MergeValues(0x0, 0x270b8a0, 0xc0023df580, 0x270b8a0, 0xc0010e2a00, 0x1, 0xc001118540, 0x2744ce0, 0xc000288e00)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/promhttputil/merge.go:162 +0x282
github.com/jacksontj/promxy/pkg/promclient.(*MultiAPI).QueryRange(0xc0003fad20, 0x2733f60, 0xc00024ca40, 0xc000f4a180, 0x32, 0x0, 0xed6a88d4b, 0x3940c20, 0x0, 0xed6a88d87, ...)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/promclient/multi_api.go:410 +0x7d8
github.com/jacksontj/promxy/pkg/promclient.(*TimeTruncate).QueryRange(0xc0005f0820, 0x2733f60, 0xc00024ca40, 0xc000f4a180, 0x32, 0x0, 0xed6a88d4b, 0x3940c20, 0x0, 0xed6a88d87, ...)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/promclient/time_truncate.go:34 +0x1cd
github.com/jacksontj/promxy/pkg/proxystorage.(*ProxyStorage).NodeReplacer(0xc00019dc30, 0x2733f60, 0xc00024ca40, 0xc0006b3310, 0x26dd740, 0xc0001e8e80, 0xc0004b6f60, 0x0, 0xc00024ca00, 0xc000f50000)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/pkg/proxystorage/proxy.go:458 +0x1bbc
github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql.Walk(0x2733f60, 0xc00024ca40, 0x26dfd60, 0xc00024c600, 0xc0006b3310, 0x26dd740, 0xc0001e8e80, 0xc00025ad90, 0x7, 0x7, ...)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/ast.go:275 +0x186f
github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql.Walk.func1(0x2733f60, 0xc00024ca40, 0x26dfd60, 0xc00024c600, 0xc0006b3310, 0xc00024c3c0, 0xc00019dd30, 0xc00117aa80, 0xc00025ad90, 0x7, ...)
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/ast.go:328 +0xf1
created by github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql.Walk
	/home/tjackson/workspace/golang/src/github.com/jacksontj/promxy/vendor/github.com/prometheus/prometheus/promql/ast.go:327 +0xd10

I analyzed code and it makes no sense to me. In merge.go:189 there is validation you added. I can't see a reason for "out of bounds" error in this line. I even build promxy myself from master branch to be sure I use newest version.
I'm still investigating it. Tomorrow I will try to update VictoriaMetrics to latest version built from master branch.

jacksontj · 2020-07-21T18:53:51Z

The stack trace here seems to be running the old code; line 189 is a return (meaning it can't generate that panic). When you tested with the new release version; I assume you pulled the binary off of the releases page? If so maybe something is wrong with the build :/

franek1709 · 2020-07-21T20:16:25Z

Yes. I downloaded binary file from releases page. I even verified checksum of my binary with checksum file available on releases page. Then I created my own binary from master branch. I'am aware of the fact that stacktrace suggests I still use old binary and I'm really confused :(. I will continue my research tomorrow.

franek1709 · 2020-07-22T15:12:09Z

I had a good idea to stop investigating this problem yesterday. In the morning I discovered additional problem we have. It looks like after logrotation new log entries stopped to appear (but modification date of log file was updated!). It makes me think this problem still occurs. In the morning I found this logrotate problem and fixed it. Now I see everything works correctly with binary you released. Sorry for bothering you and again thanks for your help.

jacksontj · 2020-07-22T15:37:42Z

No problem, I'm mostly just glad it actually fixed it -- otherwise that was about to be the craziest bug ever :)

…

On Wed, Jul 22, 2020 at 8:12 AM sfranek ***@***.***> wrote: Closed #329 <#329>. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#329 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMLPHOYEL2CCNLN345QWL3R436V3ANCNFSM4PBDH2YQ> .

valyala added a commit to VictoriaMetrics/VictoriaMetrics that referenced this issue Jul 20, 2020

app/vmselect/prometheus: do not return time series with empty list of…

141e84b

… datapoints from /api/v1/query_range This matches Prometheus behaviour. This should fix jacksontj/promxy#329

valyala added a commit to VictoriaMetrics/VictoriaMetrics that referenced this issue Jul 20, 2020

app/vmselect/prometheus: do not return time series with empty list of…

5a4675c

… datapoints from /api/v1/query_range This matches Prometheus behaviour. This should fix jacksontj/promxy#329

jacksontj added a commit that referenced this issue Jul 20, 2020

Handle empty result sets in merge logic

c2fdaee

Fixes #329

jacksontj mentioned this issue Jul 20, 2020

Handle empty result sets in merge logic #331

Merged

jacksontj added the bug label Jul 20, 2020

jacksontj added a commit that referenced this issue Jul 20, 2020

Handle empty result sets in merge logic

1352e1c

Fixes #329

jacksontj closed this as completed in #331 Jul 20, 2020

jacksontj added a commit that referenced this issue Jul 20, 2020

Handle empty result sets in merge logic

30fc828

Fixes #329

jacksontj reopened this Jul 21, 2020

franek1709 closed this as completed Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic errors in query_range calls #329

Panic errors in query_range calls #329

franek1709 commented Jul 19, 2020 •

edited

Loading

jacksontj commented Jul 20, 2020

franek1709 commented Jul 20, 2020

jacksontj commented Jul 20, 2020

franek1709 commented Jul 21, 2020

jacksontj commented Jul 21, 2020

franek1709 commented Jul 21, 2020

franek1709 commented Jul 22, 2020

jacksontj commented Jul 22, 2020 via email

Panic errors in query_range calls #329

Panic errors in query_range calls #329

Comments

franek1709 commented Jul 19, 2020 • edited Loading

jacksontj commented Jul 20, 2020

franek1709 commented Jul 20, 2020

jacksontj commented Jul 20, 2020

franek1709 commented Jul 21, 2020

jacksontj commented Jul 21, 2020

franek1709 commented Jul 21, 2020

franek1709 commented Jul 22, 2020

jacksontj commented Jul 22, 2020 via email

franek1709 commented Jul 19, 2020 •

edited

Loading