Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric corruption when using http2 nginx reverse proxy with influxdb output #2854

Closed
rlex opened this issue May 25, 2017 · 53 comments · Fixed by #2869 or #3307
Closed

Metric corruption when using http2 nginx reverse proxy with influxdb output #2854

rlex opened this issue May 25, 2017 · 53 comments · Fixed by #2869 or #3307
Labels
bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Milestone

Comments

@rlex
Copy link

rlex commented May 25, 2017

Bug report

Telegraf stopped working with nginx reverse proxy since 1.3 release

Relevant telegraf.conf:

# Telegraf configuration

[tags]

# Configuration for telegraf agent
[agent]
    interval = "10s"
    debug = false
    hostname = "my.host.fqdn"
    round_interval = true
    flush_interval = "10s"
    flush_jitter = "0s"
    collection_jitter = "0s"
    metric_batch_size = 1000
    metric_buffer_limit = 10000
    quiet = false

[[outputs.influxdb]]
    urls = ["https://host:8086"]
    database = "telegraf_metrics"
    username = "telegraf_metrics"
    password = "pass"
    retention_policy = ""
    write_consistency = "any"
    timeout = "5s"

[[inputs.cpu]]
    percpu = true
    totalcpu = true
    fielddrop = ["time_*"]
[[inputs.disk]]
    ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.mem]]
[[inputs.system]]
[[inputs.swap]]
[[inputs.internal]]
[[inputs.kernel]]
[[inputs.processes]]
[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]
[[inputs.kernel_vmstat]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.nstat]]
    dump_zeros = true
[[inputs.conntrack]]
    files = ["ip_conntrack_count","ip_conntrack_max", "nf_conntrack_count","nf_conntrack_max"]
    dirs = ["/proc/sys/net/ipv4/netfilter","/proc/sys/net/netfilter"]

System info:

Telegraf 1.3.0-1, InfluxDB 1.2.4-1 , nginx 1.13.0-1~xenial

Steps to reproduce:

  1. Install telegraf+influxdb+nginx
  2. make a simple reverse proxy ( location / { proxy_pass http://influxdb:8086; } )
  3. start pumping metrics from telegraf (telegraf -> nginx -> influxdb)
  4. From this point, influxdb will be filled with broken, unreadable metrics. They looks like partially-written metrics, something like ??????st=some.of.host.i.monitor, or ???sk=sda1.

Expected behavior:

Metrics should be arriving proparly

Actual behavior:

Lots of broken metrics with strange unicode (?) symbols.

Additional info:

Sometimes i saw lines like "2017-05-21T03:33:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write:
unable to parse '.ru timst=rails-2 total=4387237888i,used=2909868032i,free=1477369856i,used_percent=66.32574084845248 1495337590000000000': invalid boolean]"

I suspect this is somehow related to #2251

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label May 25, 2017
@danielnelson danielnelson added this to the 1.3.1 milestone May 25, 2017
@danielnelson danielnelson added the regression something that used to work, but is now broken label May 25, 2017
@danielnelson
Copy link
Contributor

I tried this out in debian, although with fewer input plugins, but was unable to reproduce.

ii  telegraf                                  1.3.0-1                              amd64
ii  influxdb                                  1.2.4-1                              amd64
ii  nginx-full                                1.10.3-1                             amd64
[[outputs.influxdb]]
  urls = ["https://loaner:443"]
  database = 'telegraf_nginx'
  retention_policy = ""
  write_consistency = "any"
  username = "dbn"
  password = "howdy"
  timeout = "5s"
  ssl_ca = "/home/dbn/.ssl/ca/cacert.pem"
$ cat /etc/nginx/sites-available/default
upstream backend {
        server 127.0.0.1:8086 fail_timeout=0;
}

server {
        listen 443 default_server;

        ssl    on;
        ssl_certificate      /etc/nginx/cert.pem;
        ssl_certificate_key  /etc/nginx/key.pem;
        server_name _;
        root /var/www;

        location / {
                proxy_set_header Host $http_host;
                proxy_redirect off;
                proxy_pass http://backend;
        }
}

Does it work if if you take nginx out? What is the output of locale?

@rlex
Copy link
Author

rlex commented May 29, 2017

It works if i take nginx out - it's how it works right now. It also works with reverse proxy if i downgrade telegraf back to 1.2.1. I also observed that problem with some self-built version (looking at my clone of repo, that was c66e289) but that was on testbox and i forgot to post issue for it.

When upgrading to 1.3 i mostly left config intact except adding 2 new inputs:

[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]

Locale output:

root@heimdall:~# locale
LANG=en_US.utf8
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

Telegraf envs, ubuntu 16.04 LTS:

root@heimdall:~# cat /proc/22480/environ
LANG=en_US.utf8PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binHOME=/etc/telegrafLOGNAME=telegrafUSER=telegrafSHELL=/bin/false

Debian 8:

LANG=en_US.UTF-8LC_CTYPE=en_US.UTF-8PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binHOME=/etc/telegrafLOGNAME=telegrafUSER=telegrafSHELL=/bin/false

Here is full nginx configs:
nginx.conf:

#Ansible managed, do not touch
user nginx  nginx;

worker_processes 4;

pid /var/run/nginx.pid;

worker_rlimit_nofile 1024;

include /etc/nginx/modules-enabled/*.conf;


events {
        worker_connections 512;
        use epoll;
        multi_accept on;
}


http {
        include /etc/nginx/mime.types;
        default_type application/octet-stream;
        sendfile on;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 65;
        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log error;
        server_tokens off;
        types_hash_max_size 2048;

        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;
}

Corresponding vhost:

#Ansible managed, do not touch
server {
   listen snip:80;
   listen snip:443 ssl http2;
   listen [snip]:443 ssl http2;
   listen [snip]:80;
   server_name snip;
   ssl_certificate /etc/nginx/ssl/snip/fullchain.pem;
   ssl_certificate_key /etc/nginx/ssl/snip/privkey.pem;
   ssl_dhparam /etc/nginx/ssl/dhparam.pem;
   client_max_body_size 10M;
   if ($scheme = http) {
       return 301 https://$server_name$request_uri;

   }
   location /.well-known/acme-challenge/ {
       alias /var/www/letsencrypt/;

   }
   location / {
       proxy_pass http://localhost:3000;
       proxy_set_header Host $host;
       proxy_set_header X-Real-IP $remote_addr;
       proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
       proxy_set_header X-Forwarded-Proto $scheme;

   }
}

Nginx also auto-includes following settings (in http {} context)
Gzip:

#Ansible managed, do not touch
gzip on;
gzip_comp_level 5;
gzip_min_length 256;
gzip_http_version 1.1;
gzip_buffers 16 8k;
gzip_proxied any;
gzip_vary on;
gzip_disable "MSIE [1-6]\.(?!.*SV1)";
gzip_types application/atom+xml application/javascript application/json application/rss+xml application/vnd.ms-fontobject application/x-font-ttf application/x-javascript application/x-web-app-manifest+json application/xhtml+xml application/xml font/opentype image/svg+xml image/x-icon text/css text/plain text/x-component text/javascript;

Open file cache:

#Ansible managed, do not touch
open_file_cache max=200000 inactive=300s;
open_file_cache_valid 300s;
open_file_cache_min_uses 2;
open_file_cache_errors on;

SSL settings:

#Ansible managed, do not touch
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_prefer_server_ciphers on;
ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH";
ssl_ecdh_curve secp384r1;
ssl_session_cache shared:SSL:10m;
ssl_session_tickets off;
ssl_stapling on;
ssl_stapling_verify on;
add_header Strict-Transport-Security "max-age=63072000";

@rlex
Copy link
Author

rlex commented May 30, 2017

Still testing.
I created second influxdb output via same config i posted in previous comment, and started getting broken metrics again.
For example, one measurement from agent with hostname like "heimdall.srv.example.org" was:

heimdalerface=eth0

On other case, there was measurement with (rabbitmq plugin, rabbit name rabbit@php-4):

bit@php-4
bit@php-4,auto_delete=false,durable=true (related series)

Also there was measurement with name

 ������vmstat

And stuff like measurement =SPU with series =SPU,type=Spurious\ interrupts

What's interesting is that test database i created started getting metrics which should not be in that DB (test is p3, i see broken metrics with hosts that should be in database p1)
I tried disabling all my tweaks (ssl, gzip, open file cache) with no luck.

UPD: looks like i have found issue. That was http/2. Still testing.

@rlex
Copy link
Author

rlex commented May 30, 2017

Two hours with disabled http/2 and no broken metrics. No other parameters was changed. I bet on http/2

@danielnelson
Copy link
Contributor

So the workaround was changing listen [snip]:443 ssl http2; to listen [snip]:443 ssl;?

@rlex
Copy link
Author

rlex commented May 30, 2017

Yes. But i would like to have it working both ways, since it's not possible to disable http2 for just one vhost - at least on nginx 1.13.0 if i enable http2 on one vhost (on same IP) it will be automatically enabled for all other hosts.

@danielnelson
Copy link
Contributor

Yeah for sure this needs fixed.

@danielnelson
Copy link
Contributor

Could be related to #2862

@danielnelson
Copy link
Contributor

Unfortunately, I don't think that fix will help, so I'm going to reopen.

@danielnelson
Copy link
Contributor

@gavinzhou are you also using http2?

@danielnelson
Copy link
Contributor

I tried to duplicate this again with http2 but still no luck. Here is my nginx configuration:

[+]dbn@debian-stretch:~$ cat /etc/nginx/nginx.conf 
user www-data;
worker_processes 4;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

events {
        worker_connections 768;
        use epoll;
        multi_accept on;
}

http {
        sendfile on;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 65;
        types_hash_max_size 2048;

        include /etc/nginx/mime.types;
        default_type application/octet-stream;

        ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
        ssl_prefer_server_ciphers on;

        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log;

        gzip on;
        gzip_disable "msie6";

        gzip_vary on;
        gzip_proxied any;
        gzip_comp_level 6;
        gzip_buffers 16 8k;
        gzip_http_version 1.1;
        gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;

        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;
}
[+]dbn@debian-stretch:~$ cat /etc/nginx/sites-enabled/default 
upstream backend {
        server 127.0.0.1:8086 fail_timeout=0;
}

server {
        #listen 80 default_server;
        #listen [::]:80 default_server;

        listen 443 ssl http2;

        ssl_certificate /etc/nginx/cert.pem;
        ssl_certificate_key /etc/nginx/key.pem;
        root /var/www/html;

        server_name _;

        location /status {
                stub_status on;
        }

        location /{
                proxy_set_header Host $http_host;
                proxy_redirect off;
                proxy_pass http://backend;
        }
}

I also added a query string to the output as hinted by @gavinzhou:

[[outputs.influxdb]]
  urls = ["https://debian-stretch?test=testtest"]
  database = "http2"
  username = "dbn"
  password = "dbn"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  insecure_skip_verify = true

My inputs are:

2017-06-06T18:30:43Z I! Loaded inputs: inputs.cpu inputs.disk inputs.diskio inputs.internal inputs.interrupts inputs.linux_sysctl_fs inputs.netstat inputs.nstat inputs.system

Versions:

ii  nginx                                 1.10.3-1                    all
ii  influxdb                              1.2.4-1                     amd64

No sign of bad metrics. How long does it usually take for corrupt metrics to appear?

@rlex
Copy link
Author

rlex commented Jun 6, 2017

Around 10-30 minutes. Maybe multiple clients causes it?

@rlex
Copy link
Author

rlex commented Jun 6, 2017

(i'm actually pretty sure it's caused by multiple clients - i have several databases for different projects, and i saw broken metrics, let's say from db "staging" in db "testing" and vice-versa)

@danielnelson
Copy link
Contributor

Okay, I only ran it for about 10 minutes. I'll try to leave it running a longer and with multiple clients.

@rlex
Copy link
Author

rlex commented Jun 6, 2017

i can probably forward some of my personal stuff to your server if that helps

@danielnelson
Copy link
Contributor

I'll let you know if I need it, about how many telegraf's are you running?

@danielnelson
Copy link
Contributor

danielnelson commented Aug 24, 2017

Could be interesting to see if this problem remains with go 1.9, though I have no indication that it would be.

@rlex
Copy link
Author

rlex commented Aug 25, 2017

The thing that caught my eye

HTTP/2 now uses the priority write scheduler by default. Frames are scheduled by following HTTP/2 priorities as described in RFC 7540 Section 5.3.

@rlex
Copy link
Author

rlex commented Aug 25, 2017

@danielnelson 1.4.0 will be compiled with go1.9?

@danielnelson
Copy link
Contributor

The official packages will be with go1.8.

@sebastianarena
Copy link

Hi. Here's my take on this. Coming from a previous reported issue #3157.
Tested for several days changing lots of things.
At the end of the day, removing Nginx completely solved all my problems.
There must be some buffering occurring on Nginx's side that causes tons of problems for influx.
HTTP2 removal helped a lot, but after a few days invalid measurements would re-appear.
It's been 48hs now testing directly to influx, with 0 invalid measurements.
Hope it helps!

@danielnelson
Copy link
Contributor

I'm looking through the code again, and I think it may be possible for the Content-Length header to be incorrect. Can someone, if it has not already been done, test with the content_encoding = "gzip" option? This would disable the Content-Length header and use chunked transfer encoding.

@danielnelson
Copy link
Contributor

@rlex Nightly builds are now built with Go 1.9, do you think you could also test with content_encoding = "gzip"?

@danielnelson
Copy link
Contributor

@scanterog Has been testing with content_encoding = "gzip" for about a day without corruption, so I think it is the Content-Length. I will get it fixed.

@scanterog
Copy link

Hi guys. It has been 5 days for me now and all looks ok. No errors anymore. So I can say content_encoding = "gzip" actually fixes it.

@danielnelson
Copy link
Contributor

Great news! I'll work on fixing the Content-Length header which I think will also allow it to work when gzip is not enabled.

@danielnelson danielnelson changed the title Nginx reverse proxy for influxdb does not work with telegraf anymore Metric corruption when using http2 nginx reverse proxy with influxdb output Oct 5, 2017
@danielnelson danielnelson added this to the 1.4.2 milestone Oct 5, 2017
@LeJav
Copy link

LeJav commented Nov 17, 2017

I have exactly the same problem with telegraf 1.4.4:
# telegraf --version Telegraf v1.4.4 (git: release-1.4 ddcb93188f3fb5393d811cdd742bfe6ec799eba9)
I have installed telegraf with official release from:
https://dl.influxdata.com/telegraf/releases/telegraf_1.4.4-1_amd64.deb

@danielnelson danielnelson reopened this Nov 17, 2017
@bobmshannon
Copy link
Contributor

bobmshannon commented Nov 21, 2017

@LeJav Are you also using nginx as an HTTP/2 reverse proxy? Is telegraf's content_encoding set to gzip?

The reason I ask is because I wonder whether this nginx issue which describes a bug where the client request body may be corrupted when streaming data using HTTP/2.

If the answer to the above answer is yes, it would be interesting to try the following to see if it helps at all:

  1. Disable proxy_request_buffering.
  2. Upgrade nginx to 1.13.6 which includes a fix for this issue.

@LeJav
Copy link

LeJav commented Nov 21, 2017

@bobmshannon

May thanks for your help.
I think that you are right and the problem seems related to nginx bug.

For nginx reverse proxy, I have not specified proxy_request_buffering, so it is on by default (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_request_buffering)

I have set content_encoding set to gzip in telegraf config since more than 2 days and I have no more problem since.

I am using nginx 1.10.0

Is it worth to try proxy_request_buffering to off and remove content_encoding in telegraf config?

@danielnelson
Copy link
Contributor

Is it worth to try proxy_request_buffering to off and remove content_encoding in telegraf config?

@LeJav If you can do this it would be very much appreciated.

@LeJav
Copy link

LeJav commented Nov 23, 2017

@danielnelson : I will try it next week end

@danielnelson danielnelson removed their assignment Nov 29, 2017
@LeJav
Copy link

LeJav commented Dec 9, 2017

Sorry for this long delay.
I have:
. set proxy_request_buffering to off
. removed content_encoding=gzip
nginx is 1.10.0

I will check within a couple of days if it is OK or if new corrupted measurements have been created.

@LeJav
Copy link

LeJav commented Dec 10, 2017

16 hours later, I have several corrupted measurements:
ăÇ�����face=eth0
ăÇ�����az-telegraf-net-int
ăÇ�����face=eth0
ăÇ���¾az-telegraf-net-int
ăÇ���¾face=eth0
ăÇ�����az-telegraf-net-int
ăÇ�����face=eth0
...

So this does not work.
I will reset removed content_encoding=gzip which is working fine.

@danielnelson
Copy link
Contributor

The InfluxDB output has been rewritten for 1.6.0, as well as the metric seriailzer. It would be very much appreciated if someone could retest this issue.

@danielnelson
Copy link
Contributor

Please let me know if anyone runs into this issue in 1.6 or later.

@LeJav
Copy link

LeJav commented May 30, 2018

Hi Daniel,
Thanks for your effort.
I am sorry but I am very very busy at that time and have no free time to test it...

@danielnelson
Copy link
Contributor

@LeJav Understandable, thanks for all the help so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Projects
None yet
9 participants