Skip to content

crashes cause bad chunks that fail to send #1638

Closed
@notslang

Description

@notslang

I had a power failure and upon restarting the server, this happened:

2017-07-18 18:05:27 +0000 [info]: adding match pattern="**" type="null"
2017-07-18 18:05:27 +0000 [info]: adding source type="forward"
2017-07-18 18:05:27 +0000 [info]: #0 starting fluentd worker pid=20 ppid=16 worker=0
2017-07-18 18:05:31 +0000 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2017-07-18 18:05:31 +0000 [info]: #0 [input1] listening port port=24224 bind="0.0.0.0"
2017-07-18 18:05:31 +0000 [info]: #0 fluentd worker is now running worker=0
2017-07-18 22:25:08 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:09 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:614:in `dup'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:614:in `send_data_actual'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:631:in `send_data'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:285:in `block in try_write'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:304:in `block in select_a_healthy_node'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:298:in `times'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:298:in `select_a_healthy_node'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:285:in `try_write'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:1034:in `try_flush'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:1268:in `flush_thread_run'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:420:in `block (2 levels) in start'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2017-07-18 22:25:09 +0000 [warn]: #0 retry succeeded. chunk_id="5549b016b0815316776e7f21d3a793d8"
2017-07-18 22:25:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:10 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:09 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:10 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:10 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:10 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2017-07-18 22:25:11 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:11 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:13 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2017-07-18 22:25:13 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:13 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2017-07-18 22:25:17 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:17 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:25 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=2017-07-18 22:25:25 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:25 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:41 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=2017-07-18 22:25:41 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:41 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:26:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=6 next_retry_seconds=2017-07-18 22:26:11 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:26:11 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:27:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=7 next_retry_seconds=2017-07-18 22:27:09 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:27:09 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:29:15 +0000 [warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2017-07-18 22:29:15 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:29:15 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:33:24 +0000 [warn]: #0 failed to flush the buffer. retry_time=9 next_retry_seconds=2017-07-18 22:33:24 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:33:24 +0000 [warn]: #0 suppressed same stacktrace

Given the fact that this hasn't happened in the last month, and the chunk that it's referencing buffer.q5541970ab3b8c59d8f012cd3222cc93c.log was created right around the time the power failed, I think that it is somehow corrupted by an incomplete write. This is a problem because the error message makes no sense and crashes shouldn't cause corruption / data loss for things that are already written to the disk.

It doesn't even start sending the other chunks that aren't corrupted, I need to shut down fluentd, remove the bad chunk, and restart fluentd to get it to continue sending.

I can send you the bad chunk and meta file if you want.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions