-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric corruption when using http2 nginx reverse proxy with influxdb output #2854
Comments
I tried this out in debian, although with fewer input plugins, but was unable to reproduce.
[[outputs.influxdb]]
urls = ["https://loaner:443"]
database = 'telegraf_nginx'
retention_policy = ""
write_consistency = "any"
username = "dbn"
password = "howdy"
timeout = "5s"
ssl_ca = "/home/dbn/.ssl/ca/cacert.pem"
Does it work if if you take nginx out? What is the output of |
It works if i take nginx out - it's how it works right now. It also works with reverse proxy if i downgrade telegraf back to 1.2.1. I also observed that problem with some self-built version (looking at my clone of repo, that was c66e289) but that was on testbox and i forgot to post issue for it. When upgrading to 1.3 i mostly left config intact except adding 2 new inputs:
Locale output:
Telegraf envs, ubuntu 16.04 LTS:
Debian 8:
Here is full nginx configs:
Corresponding vhost:
Nginx also auto-includes following settings (in http {} context)
Open file cache:
SSL settings:
|
Still testing.
On other case, there was measurement with (rabbitmq plugin, rabbit name rabbit@php-4):
Also there was measurement with name
And stuff like measurement =SPU with series =SPU,type=Spurious\ interrupts What's interesting is that test database i created started getting metrics which should not be in that DB (test is p3, i see broken metrics with hosts that should be in database p1) UPD: looks like i have found issue. That was http/2. Still testing. |
Two hours with disabled http/2 and no broken metrics. No other parameters was changed. I bet on http/2 |
So the workaround was changing |
Yes. But i would like to have it working both ways, since it's not possible to disable http2 for just one vhost - at least on nginx 1.13.0 if i enable http2 on one vhost (on same IP) it will be automatically enabled for all other hosts. |
Yeah for sure this needs fixed. |
Could be related to #2862 |
Unfortunately, I don't think that fix will help, so I'm going to reopen. |
@gavinzhou are you also using http2? |
I tried to duplicate this again with http2 but still no luck. Here is my nginx configuration:
I also added a query string to the output as hinted by @gavinzhou:
My inputs are:
Versions:
No sign of bad metrics. How long does it usually take for corrupt metrics to appear? |
Around 10-30 minutes. Maybe multiple clients causes it? |
(i'm actually pretty sure it's caused by multiple clients - i have several databases for different projects, and i saw broken metrics, let's say from db "staging" in db "testing" and vice-versa) |
Okay, I only ran it for about 10 minutes. I'll try to leave it running a longer and with multiple clients. |
i can probably forward some of my personal stuff to your server if that helps |
I'll let you know if I need it, about how many telegraf's are you running? |
Could be interesting to see if this problem remains with go 1.9, though I have no indication that it would be. |
The thing that caught my eye
|
@danielnelson 1.4.0 will be compiled with go1.9? |
The official packages will be with go1.8. |
Hi. Here's my take on this. Coming from a previous reported issue #3157. |
I'm looking through the code again, and I think it may be possible for the Content-Length header to be incorrect. Can someone, if it has not already been done, test with the |
@rlex Nightly builds are now built with Go 1.9, do you think you could also test with content_encoding = "gzip"? |
@scanterog Has been testing with |
Hi guys. It has been 5 days for me now and all looks ok. No errors anymore. So I can say |
Great news! I'll work on fixing the Content-Length header which I think will also allow it to work when gzip is not enabled. |
I have exactly the same problem with telegraf 1.4.4: |
@LeJav Are you also using nginx as an HTTP/2 reverse proxy? Is telegraf's The reason I ask is because I wonder whether this nginx issue which describes a bug where the client request body may be corrupted when streaming data using HTTP/2. If the answer to the above answer is yes, it would be interesting to try the following to see if it helps at all:
|
May thanks for your help. For nginx reverse proxy, I have not specified proxy_request_buffering, so it is on by default (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_request_buffering) I have set I am using nginx 1.10.0 Is it worth to try proxy_request_buffering to off and remove content_encoding in telegraf config? |
@LeJav If you can do this it would be very much appreciated. |
@danielnelson : I will try it next week end |
Sorry for this long delay. I will check within a couple of days if it is OK or if new corrupted measurements have been created. |
16 hours later, I have several corrupted measurements: So this does not work. |
The InfluxDB output has been rewritten for 1.6.0, as well as the metric seriailzer. It would be very much appreciated if someone could retest this issue. |
Please let me know if anyone runs into this issue in 1.6 or later. |
Hi Daniel, |
@LeJav Understandable, thanks for all the help so far |
Bug report
Telegraf stopped working with nginx reverse proxy since 1.3 release
Relevant telegraf.conf:
System info:
Telegraf 1.3.0-1, InfluxDB 1.2.4-1 , nginx 1.13.0-1~xenial
Steps to reproduce:
Expected behavior:
Metrics should be arriving proparly
Actual behavior:
Lots of broken metrics with strange unicode (?) symbols.
Additional info:
Sometimes i saw lines like "2017-05-21T03:33:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write:
unable to parse '.ru timst=rails-2 total=4387237888i,used=2909868032i,free=1477369856i,used_percent=66.32574084845248 1495337590000000000': invalid boolean]"
I suspect this is somehow related to #2251
The text was updated successfully, but these errors were encountered: