Telegraf stops konsuming from partition on GetOffset error, does not try again (affects entire consumer group) #3553
Description
Telegraf version telegraf-1.4.5-1.x86_64 running on 3.10.0-327.36.3.el7.x86_64, Kafka is 1.0.0, Scala for Kafka is 2.12.
Everything runs fine, until:
Dec 07 18:15:53 telecons02 telegraf[2820]: 2017-12-07T17:15:53Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/26: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
After this, the specified partition is not consumed anymore and messages pile up.
This is fixable by restating telegraf.
Interesting fact is that I have a few telegraf instances consuming kafka messages and all of them hit this issue on a few partitions (random partitions, cannot localize it to one partition). When I restart one telegraf instance the entire consumer group goes back to normal and messages are flowing (even on partitions served by the other instances that were stuck).
Pls help ;(