Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kafka_consumer input plugin : stop gathering log after a timeout #805

Closed
prune998 opened this issue Mar 7, 2016 · 4 comments · Fixed by #956
Closed

kafka_consumer input plugin : stop gathering log after a timeout #805

prune998 opened this issue Mar 7, 2016 · 4 comments · Fixed by #956
Labels
bug unexpected problem or unintended behavior

Comments

@prune998
Copy link
Contributor

prune998 commented Mar 7, 2016

I have a Telegraf instance which is used to gather logs from 3 different KAfka servers using the Kafka_consumer plugin.
Here is the config :

[[inputs.kafka_consumer]]
  topics = ["graphite"]
  zookeeper_peers = ["cherry:2181","cocobolo:2181","maple:2181","iocaste:2181","aitne:2181","telesto:2181"]
  zookeeper_chroot = "/kafka"
  consumer_group = "telegraf_metrics_consumers"
  offset = "oldest"
  data_format = "influx"

[[inputs.kafka_consumer]]
  topics = ["graphite"]
  zookeeper_peers = ["service-3:2181"]
  zookeeper_chroot = "/kafka"
  consumer_group = "telegraf_metrics_consumers"
  offset = "oldest"
  data_format = "influx"

[[inputs.kafka_consumer]]
  topics = ["graphite"]
  zookeeper_peers = ["service-0:2181","service-1:2181","service-2:2181"]
  zookeeper_chroot = "/kafka"
  consumer_group = "telegraf_metrics_consumers"
  offset = "oldest"
  data_format = "influx"

After some time (a day) I see in the logs :

2016/03/04 14:02:50 Connected to 10.2.3.4:2181
2016/03/04 14:02:50 Authenticated: id=383558105974775727, timeout=4000
2016/03/04 14:02:50 Started the kafka consumer service, peers: [cherry:2181 cocobolo:2181 maple:2181 iocaste:2181 aitne:2181 telesto:2181], topics: [graphite]
2016/03/04 14:02:50 Connected to 34.34.56.78:2181
2016/03/04 14:02:50 Authenticated: id=95356658738731814, timeout=4000
2016/03/04 14:02:50 Started the kafka consumer service, peers: [service-3:2181], topics: [graphite]
2016/03/04 14:02:50 Connected to 12.34.56.78:2181
2016/03/04 14:02:51 Authenticated: id=239441231443580040, timeout=4000
2016/03/04 14:02:54 Started the kafka consumer service, peers: [service-0:2181 service-1:2181 service-2:2181], topics: [graphite]

2016/03/04 18:48:28 Recv loop terminated: err=read tcp 10.235.10.103:28265->12.34.56.78:2181: i/o timeout
2016/03/04 18:48:28 Send loop terminated: err=<nil>
2016/03/04 18:48:29 Failed to connect to 12.34.56.76:2181: dial tcp 12.34.56.76:2181: i/o timeout
2016/03/04 18:48:30 Failed to connect to 12.34.56.77:2181: dial tcp 12.34.56.77:2181: i/o timeout
2016/03/04 18:48:32 Failed to connect to 12.34.56.78:2181: dial tcp 12.34.56.78:2181: i/o timeout
2016/03/04 18:48:33 Failed to connect to 12.34.56.79:2181: dial tcp 12.34.56.78:2181: i/o timeout
2016/03/04 18:48:34 Connected to 12.34.56.79:2181
2016/03/04 18:48:34 Authenticated: id=239441231443580040, timeout=4000

When this happen, Telegraf is connected to both zookeeper and kafka, but no log is gathered.
Everything is fine after a restart...
Is there a way to add more debug on this specific inputs ?

@sparrc sparrc added the bug unexpected problem or unintended behavior label Mar 7, 2016
@prune998
Copy link
Contributor Author

prune998 commented Mar 8, 2016

Issue happened again with my remote servers in asia.
Logs from the application using debug mode (stripping all the Gathered logs) :

2016/03/08 11:52:01 Recv loop terminated: err=read tcp 10.235.10.103:25563->54.6.1.37:2181: i/o timeout
2016/03/08 11:52:01 Send loop terminated: err=<nil>
2016/03/08 11:52:02 Gathered metrics, (10s interval), from 14 inputs in 2.101842141s
2016/03/08 11:52:02 Failed to connect to 54.6.1.254:2181: dial tcp 54.6.1.254:2181: i/o timeout
2016/03/08 11:52:03 Failed to connect to 54.6.1.44:2181: dial tcp 54.6.1.44:2181: i/o timeout
2016/03/08 11:52:05 Failed to connect to 54.6.1.37:2181: dial tcp 54.6.1.37:2181: i/o timeout
2016/03/08 11:52:06 Failed to connect to 54.6.1.254:2181: dial tcp 54.6.1.254:2181: i/o timeout
2016/03/08 11:52:07 Failed to connect to 54.6.1.44:2181: dial tcp 54.6.1.44:2181: i/o timeout
2016/03/08 11:52:09 Wrote 9252 metrics to output influxdb in 1.064282626s
2016/03/08 11:52:09 Failed to connect to 54.6.1.37:2181: dial tcp 54.6.1.37:2181: i/o timeout
2016/03/08 11:52:10 Failed to connect to 54.6.1.254:2181: dial tcp 54.6.1.254:2181: i/o timeout
2016/03/08 11:52:10 Connected to 54.6.1.44:2181
2016/03/08 11:52:10 Authentication failed: zk: session has been expired by the server
2016/03/08 11:52:11 Connected to 54.6.1.37:2181
2016/03/08 11:52:12 Gathered metrics, (10s interval), from 14 inputs in 2.105061493s
2016/03/08 11:52:12 Authenticated: id=23944443693824, timeout=4000

I can see that Telegraf is still connected to some servers :

lsof -n -p 929371|grep 54.64
telegraf 929371 user   13u     IPv4          232202807      0t0       TCP 10.235.10.103:23039->54.6.1.254:9092 (ESTABLISHED)
telegraf 929371 user   15u     IPv4          236828808      0t0       TCP 10.235.10.103:33170->54.6.1.254:9092 (ESTABLISHED)
telegraf 929371 user   16u     IPv4          236793279      0t0       TCP 10.235.10.103:4139->54.6.1.37:9092 (ESTABLISHED)
telegraf 929371 user   18u     IPv4          236755821      0t0       TCP 10.235.10.103:23221->54.6.1.44:9092 (ESTABLISHED)
telegraf 929371 user   20u     IPv4          237194160      0t0       TCP 10.235.10.103:21491->54.6.1.37:2181 (ESTABLISHED)

Doing a TCPDUMP show some trafic on port 54.6.1.37:2181 (zookeeper) but nothing on port 9092 (kafka).
Can't really tell what is in the traffic... maybe some sort of ping/keepalive ?
But Telegraf never send any request to Kafka.
as -debug does not give more clues, anyone with an idea ?

One will tell that this comes from the underlying kafka/zookeeper code (sarama ?) but still.
Let me know if I can provide any other informations.
Servers are all linux-ubuntu 14.04.

@sparrc
Copy link
Contributor

sparrc commented Mar 29, 2016

This appears to be this issue: wvanbergen/kafka#87

Hopefully the maintainer of this repo will return to maintaining and get the change merged, otherwise we're going to need to start looking at alternatives

@prune998
Copy link
Contributor Author

possibly...
As you say, if the issue is in Kafka/Zk, we're in trouble and means that using Kafka as a broker is not production ready... should we go back to Redis ? :(

@sparrc
Copy link
Contributor

sparrc commented Mar 30, 2016

That's not an issue with kafka/zk, it's an issue with the client library we're using. If wvanbergen is no longer maintaining that client library then we will need to investigate other options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants