Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telegraf panic in Exec plugin #1199

Closed
djahandarie opened this issue May 15, 2016 · 8 comments
Closed

telegraf panic in Exec plugin #1199

djahandarie opened this issue May 15, 2016 · 8 comments
Labels
bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf

Comments

@djahandarie
Copy link

djahandarie commented May 15, 2016

On telegraf 0.13.0.

I installed telegraf on this box earlier today, and about 30 minutes later, I got this in telegraf.log:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x8 pc=0x527857]

goroutine 29435 [running]:
panic(0x1073e20, 0xc82000c090)
        /usr/local/go/src/runtime/panic.go:481 +0x3e6
bytes.(*Buffer).grow(0x0, 0x2f6374652f3d454d, 0x66617267656c6574)
        /usr/local/go/src/bytes/buffer.go:88 +0x27
created by github.com/influxdata/telegraf/plugins/inputs/exec.(*Exec).Gather
        /root/go/src/github.com/influxdata/telegraf/plugins/inputs/exec/exec.go:157 +0x1dd

The same telegraf configuration+version pair has no issues on a dozen or so other boxes that it runs on.

This seems like a really odd panic. The documentation for Buffer.grow says it can panic with an ErrTooLarge, but this seems to be different. The line in particular is m := b.Len(), and for that to panic with a nil pointer dereference, I guess that means the pointer to the buffer got clobbered (and I think the first parameter being 0x0 confirms that?).

This backtrace does not seem to be providing the full story on how line exec.go:157 leads to a buffer being grown, so I'm mostly at a loss as to what's going on. I'm happy to provide more debug info if needed.

@djahandarie
Copy link
Author

I've since been lead to believe that there is some general memory corruption issue with this box, so I think it is likely not telegraf's fault, so I'm closing this.

(Although, changing telegraf's systemd handler to auto-restart the process might be a nice general extra layer of protection for problems like this.)

@sparrc
Copy link
Contributor

sparrc commented May 16, 2016

Thanks @djahandarie, but I'm going to leave this open until I have time to investigate that line of code

@sparrc sparrc reopened this May 16, 2016
@sparrc sparrc added the bug unexpected problem or unintended behavior label May 18, 2016
@sparrc
Copy link
Contributor

sparrc commented May 18, 2016

I see that this is actually panicking in the Go source code, so closing because it's definitely caused by your memory corruption issue.

fwiw, Telegraf is supposed to be restarted on failure (that's what this line is for: https://github.com/influxdata/telegraf/blob/master/scripts/telegraf.service#L11)

@sparrc sparrc closed this as completed May 18, 2016
@djahandarie
Copy link
Author

djahandarie commented May 22, 2016

@sparrc So, I hit this again on a different box:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x8 pc=0x527857]

goroutine 8853545 [running]:
panic(0x1073e20, 0xc82000c090)
        /usr/local/go/src/runtime/panic.go:481 +0x3e6
bytes.(*Buffer).grow(0x0, 0x2f6374652f3d454d, 0x66617267656c6574)
        /usr/local/go/src/bytes/buffer.go:88 +0x27
created by github.com/influxdata/telegraf/plugins/inputs/exec.(*Exec).Gather
        /root/go/src/github.com/influxdata/telegraf/plugins/inputs/exec/exec.go:157 +0x1dd

This new box also happens to be an AWS EC2 box (in a completely different region). No non-EC2 boxes have had this issue so far.

It could just be that this box also has some memory issue. Or there's a Xen bug. Or there's a go bug. Or there's a telegraf bug.

What's interesting is that it's the same exactly line of exec.go that lead to the panic as last time, with the same exact function parameters -- if it was really some generic memory or other bug, why would it only be manifesting in this one exact spot?

I still don't understand how exec.go:157 leads to a buffer being grown, do you have any thoughts there?

@sparrc
Copy link
Contributor

sparrc commented May 22, 2016

I don't quite understand how this would happen either. I'm assuming you are running version 0.13?

re-opening for further investigation...

@sparrc sparrc reopened this May 22, 2016
@djahandarie
Copy link
Author

Yep, 0.13.

Here's another version of it:

unexpected fault address 0x0
fatal error: fault
[signal 0xb code=0x80 addr=0x0 pc=0x527857]

goroutine 532568 [running]:
runtime.throw(0x1228200, 0x5)
        /usr/local/go/src/runtime/panic.go:547 +0x90 fp=0xc8203ca9f0 sp=0xc8203ca9d8
runtime.sigpanic()
        /usr/local/go/src/runtime/sigpanic_unix.go:27 +0x2ab fp=0xc8203caa40 sp=0xc8203ca9f0
bytes.(*Buffer).grow(0x7eb37ba7ec7f0e58, 0xd4c9046b25b54c89, 0x1c5190dca7153987)
        /usr/local/go/src/bytes/buffer.go:88 +0x27 fp=0xc8203caae8 sp=0xc8203caa40
created by github.com/influxdata/telegraf/plugins/inputs/exec.(*Exec).Gather
        /root/go/src/github.com/influxdata/telegraf/plugins/inputs/exec/exec.go:157 +0x1dd

[a dozen or so other running goroutines omitted]

This time the first parameter of grow is an actual pointer, but it still leads to a panic.

I'm going to build a 0.13 binary with -race and try running that instead to see if a data race is involved.

@zarnovican
Copy link
Contributor

I have possibly related issue #1424.

@sparrc sparrc added the panic issue that results in panics from Telegraf label Oct 6, 2016
@danielnelson
Copy link
Contributor

@djahandarie Many of the issues we have had in this bit of code have been fixed by recent Go versions. Can you let me know if you still see these issues with the latest release? I'm pretty confident it is fixed so I'm going to close this issue but we can reopen if it is still occurring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf
Projects
None yet
Development

No branches or pull requests

4 participants