Description
A bit of context first: I'm deploying fluentd on kubernetes using the fluent/fluentd:v0.14.23
docker image as three pods managed by ReplicationControllers with PersistentVolumes for storing buffers (cannot use a StatefulSet here but that's another story).
Since I have upgraded from v0.14.22 to v0.14.23, only two of those pods are running fine.
The third one is having its worker to crash loop as soon as it starts, apparently when it tries to read its buffers metadata:
I 2017-11-23 12:15:59 +0000 [info]: #0 starting fluentd worker pid=657 ppid=5 worker=0
I 2017-11-23 12:15:59 +0000 [error]: #0 unexpected error error_class=TypeError error="no implicit conversion of Symbol into Integer"
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:219:in `[]'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:219:in `restore_metadata'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:322:in `load_existing_staged_chunk'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:51:in `initialize'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:144:in `new'
| 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:144:in `block in resume'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:133:in `glob'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:133:in `resume'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer.rb:171:in `start'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:120:in `start'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/output.rb:415:in `start'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:165:in `block in start'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:154:in `block (2 levels) in lifecycle'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:153:in `each'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:153:in `block in lifecycle'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:140:in `each'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:140:in `lifecycle'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:164:in `start'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/engine.rb:274:in `start'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/engine.rb:219:in `run'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:774:in `run_engine'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:523:in `block in run_worker'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:699:in `main_process'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:518:in `run_worker'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/command/fluentd.rb:316:in `<top (required)>'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/bin/fluentd:5:in `<top (required)>'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/bin/fluentd:22:in `load'
I 2017-11-23 12:15:59 +0000 [error]: #0 /usr/bin/fluentd:22:in `<main>'
I 2017-11-23 12:15:59 +0000 [error]: #0 unexpected error error_class=TypeError error="no implicit conversion of Symbol into Integer"
I 2017-11-23 12:15:59 +0000 [error]: #0 suppressed same stacktrace
I 2017-11-23 12:15:59 +0000 [info]: fluent/log.rb:316:info: Worker 0 finished unexpectedly with status 1
This happens at https://github.com/fluent/fluentd/blob/v0.14.23/lib/fluent/plugin/buffer/file_chunk.rb#L219
The buffers of this pod look corrupt:
# ls -latr /var/log/pods/
total 168
drwxrwS--- 2 root fluent 16384 Nov 15 11:46 lost+found
drwxrwsr-x 3 root fluent 151552 Nov 16 14:40 .
-rw-rw-r-- 1 fluent fluent 75 Nov 16 14:40 buffer.b55e1a98d6e429d3dda1b152fd10035da.log.meta
-rw-rw-r-- 1 fluent fluent 0 Nov 16 14:40 buffer.b55e1a98d6e429d3dda1b152fd10035da.log
-rw-rw-r-- 1 fluent fluent 76 Nov 16 14:40 buffer.b55e1a98689733872d3b28dbb9c3483e4.log.meta
-rw-rw-r-- 1 fluent fluent 0 Nov 16 14:40 buffer.b55e1a98689733872d3b28dbb9c3483e4.log
drwxr-xr-x 1 root root 4096 Nov 23 12:12 ..
# hexdump /var/log/pods/buffer.b55e1a98d6e429d3dda1b152fd10035da.log.meta
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000040
hexdump /var/log/pods/buffer.b55e1a98689733872d3b28dbb9c3483e4.log.meta
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000040
I've deleted those buffers to workaround this issue for now and the pod is back to normal.
Not sure the upgrade is the culprit, perhaps it was just the kill that happened at the wrong time.
Maybe the restore_metadata
could add checks to avoid this issue.