Skip to content

Restoring buffer metadata crashes worker with empty buffer and/or corrupt metadata #1760

Closed
@pdecat

Description

@pdecat

A bit of context first: I'm deploying fluentd on kubernetes using the fluent/fluentd:v0.14.23 docker image as three pods managed by ReplicationControllers with PersistentVolumes for storing buffers (cannot use a StatefulSet here but that's another story).

Since I have upgraded from v0.14.22 to v0.14.23, only two of those pods are running fine.
The third one is having its worker to crash loop as soon as it starts, apparently when it tries to read its buffers metadata:

I  2017-11-23 12:15:59 +0000 [info]: #0 starting fluentd worker pid=657 ppid=5 worker=0
I  2017-11-23 12:15:59 +0000 [error]: #0 unexpected error error_class=TypeError error="no implicit conversion of Symbol into Integer"
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:219:in `[]'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:219:in `restore_metadata'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:322:in `load_existing_staged_chunk'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:51:in `initialize'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:144:in `new'
|    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:144:in `block in resume'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:133:in `glob'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:133:in `resume'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer.rb:171:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:120:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/output.rb:415:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:165:in `block in start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:154:in `block (2 levels) in lifecycle'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:153:in `each'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:153:in `block in lifecycle'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:140:in `each'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:140:in `lifecycle'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:164:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/engine.rb:274:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/engine.rb:219:in `run'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:774:in `run_engine'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:523:in `block in run_worker'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:699:in `main_process'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:518:in `run_worker'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/command/fluentd.rb:316:in `<top (required)>'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/bin/fluentd:5:in `<top (required)>'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/bin/fluentd:22:in `load'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/bin/fluentd:22:in `<main>'
I  2017-11-23 12:15:59 +0000 [error]: #0 unexpected error error_class=TypeError error="no implicit conversion of Symbol into Integer"
I    2017-11-23 12:15:59 +0000 [error]: #0 suppressed same stacktrace
I  2017-11-23 12:15:59 +0000 [info]: fluent/log.rb:316:info: Worker 0 finished unexpectedly with status 1 

This happens at https://github.com/fluent/fluentd/blob/v0.14.23/lib/fluent/plugin/buffer/file_chunk.rb#L219

The buffers of this pod look corrupt:

# ls -latr /var/log/pods/                                 
total 168                                     
drwxrwS---    2 root     fluent       16384 Nov 15 11:46 lost+found                          
drwxrwsr-x    3 root     fluent      151552 Nov 16 14:40 .                                   
-rw-rw-r--    1 fluent   fluent          75 Nov 16 14:40 buffer.b55e1a98d6e429d3dda1b152fd10035da.log.meta                                                                                
-rw-rw-r--    1 fluent   fluent           0 Nov 16 14:40 buffer.b55e1a98d6e429d3dda1b152fd10035da.log                                                                                     
-rw-rw-r--    1 fluent   fluent          76 Nov 16 14:40 buffer.b55e1a98689733872d3b28dbb9c3483e4.log.meta                                                                                
-rw-rw-r--    1 fluent   fluent           0 Nov 16 14:40 buffer.b55e1a98689733872d3b28dbb9c3483e4.log                                                                                     
drwxr-xr-x    1 root     root          4096 Nov 23 12:12 ..                                  
# hexdump /var/log/pods/buffer.b55e1a98d6e429d3dda1b152fd10035da.log.meta                                                                              
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000040
hexdump /var/log/pods/buffer.b55e1a98689733872d3b28dbb9c3483e4.log.meta
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000040

I've deleted those buffers to workaround this issue for now and the pod is back to normal.

Not sure the upgrade is the culprit, perhaps it was just the kill that happened at the wrong time.

Maybe the restore_metadata could add checks to avoid this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions