Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching for the topics metadata #75

Merged
merged 4 commits into from
Jan 31, 2019

Conversation

jorgelbg
Copy link
Contributor

This implements the feature mentioned in #74. Adding a configurable caching interval for the topics metadata. Only the metadata is cached, the offset/lag is requested/calculated everytime so this should not affect the metrics.

Depending on the interval configured (30s by default) could be the case that a new topic is added and is not picked up immediately.

@jorgelbg jorgelbg changed the title Metadata cache Add caching for the topics metadata Nov 14, 2018
@jorgelbg
Copy link
Contributor Author

Sorry for the big PR. I upgraded sarama to v1.19.0 because the client metadata was being cleared. Even if a larger RefreshFrequency was set.

Setting the RefreshFrequency (now that works) could render most of the other code not needed (already keeps an internal time counting twice. I'll update the PR with the change.

@jorgelbg
Copy link
Contributor Author

Setting the config.Metadata.RefreshFrequency could avoid the entire call to RefreshMetadata. But we would need to remove such call from Collect because e.client.RefreshMetadata() doesn't check the RefreshFrequency and executes the request every time.

@bojleros
Copy link

bojleros commented Jan 31, 2019

@jorgelbg do you think this can fix problem of having a gaps in dataseries ? I am now testing kafka-exporter in our staging environment and this happen all over again. What I have found is that this gaps occur in a similar timefreames when metadata gets refreshed (note that this io timeouts also happen occasionally):

[sarama] 2019/01/31 12:05:26 Failed to connect to broker kafka-sea-1:9092: dial tcp 172.20.x.x:9092: i/o timeout
[sarama] 2019/01/31 12:05:26 Failed to connect to broker kafka-sea-2:9092: dial tcp 172.20.x.x:9092: i/o timeout
[sarama] 2019/01/31 12:05:26 client/metadata fetching metadata for [xxx] from broker kafka-sea-1:9092
[sarama] 2019/01/31 12:05:26 client/metadata fetching metadata for [yyy] from broker kafka-sea-1:9092
[sarama] 2019/01/31 12:05:26 client/metadata fetching metadata for [zzz] from broker kafka-sea-1:9092
[sarama] 2019/01/31 12:05:26 client/metadata fetching metadata for [aaa] from broker kafka-sea-1:9092

image

For me it is a major issue that prevents me from using this exporter inside of our alerting chain. I've posted similar comments but see no activity for a longer time ...

@danielqsj
Copy link
Owner

looks good to me. Sorry for the delayed response.

@danielqsj
Copy link
Owner

Thanks @jorgelbg
@bojleros can you try it and see whether your issues are fixed?

@danielqsj danielqsj merged commit 56c1d47 into danielqsj:master Jan 31, 2019
@jorgelbg
Copy link
Contributor Author

@danielqsj don't worry, thanks for taking the time!

@bojleros this could help (I think) because it will reduce the additional time that refreshing the metadata causes, especially if you have a big cluster.

Increasing the scrape interval in prometheus could also help. Reading the lag/offset data and sending the response back could also be time consuming if you have a lot of topics.

@bojleros
Copy link

@danielqsj @jorgelbg

I am going to start my tests tomorrow morning.

Our clusters are rather small with tightly monitored performance. I am sure we have resources and not hitting fd limit. We do also have our own tool (https://github.com/msales/kage) running next to the kafka_exporters. It is how this gaps halted us before using kafka_exporter on production. I am not an kage developer but maybe you will be interested in comparing code of both projects in search for possible improvements.

Apart of that i saw that kafka exporter is producing a new values every each minute. Is it feasible to have it refreshing values each 15 seconds in order to fit our default scrape duration ?

Kind Regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants