-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(outputs.kafka): Add config option to set producer message timestamp #15675
Comments
Hi, What "data format" are you using?
This is because the kafka output does not set timestamps. It is entirely up to your data format's serializer that you are using to set the timestamps on messages. This is why asking what data format is in use was the first question in the old issue. The old issue was also requesting setting the timestamp in the Kafka message itself, not in the data. Your request is based on incorrect or bad timestamps from your data, which would be handled in your data format/serializer. I am hesitate to entertain this because this seems like a very bad footgun as well. If you are not setting your timestamp on data when it is sent, this can lead to situations where telegraf sends lots of data with no timestamps how can you determine when it was actually created?! You are now dependent on data consistently flowing between telegraf and kafka with no delay. If you do get a delay you could get a backlog of data that is sent to kafka all at once, with no timestamp to differentiate between the values. |
I think there's a misunderstanding here.
When I say that kafka output plugin sets message's timestamp I mean this block of code https://github.com/influxdata/telegraf/blob/master/plugins/outputs/kafka/kafka.go#L210-L212. It doesn't matter what the format of data you're using, the kafka message format stays the same (to simplify it would be key (empty in our case), value (the body of the message containing telegraf's data), timestamp (which is set by the output plugin currently based on the metric's timestamp)).
Well not necessarily. There could be a case, that you're storing your metrics with their original timestamps (that come from end devices in form of "time" field for example). And those timestamps have different timezones. So some metrics might have their timestamps be "in the future" relative to others. If I understand you correctly you're proposing to validate timestamps before the output step, but they're not necessarily invalid timestamps (though in case of out-of-sync clock on end-user's device they are).
I probably explained it poorly. I'm not talking about changing telegraf's data timestamps. I only propose an ability to change kafka message's timestamp (which is different from telegraf's data timestamp in general) source. Again it's kafka's timestamp which gets set to metric's timestamp by the output plugin. This behavior has its own use cases, but also has a possibility to break kafka's retention policy in the case described above.
Kafka mainly uses this timestamps to implement its retention policies. Second use case is searching messages by timestamps. There's no problem if some messages have the same timestamps. The only "requirement" to the timestamps is that they're "in order". I.e. next time stamp is greater or equal to previous timestamp. Which might not always be the case for metrics from different devices. I'll try to explain in more details, maybe this will make it clearer. In the library used by kafka output plugin, timestamp by default (i.e. no timestamp explicitly set for message) is set to producer's host time. In the issue I propose to add the option that allows to switch to this behavior instead of current kafka output behavior. Alternatively we can leave output's behavior as is, but add a note to documentation, saying that metrics' timestamps need to be ordered non-decreasingly in case kafka's |
yes this was not what I took away from the first message ;) I think #15689 covers what you want? If so, can you download artifacts in 20-30mins and try it out to confirm? |
Yes #15689 is exactly what I wanted, thank you! Won't be able to try it out with this changes today. But I already tested similar changes with my local build (minus config option), so I think it's good to merge. |
Use Case
Imagine a following setup: telegraf consumes metrics with http_listener_v2 input plugin, processes them and outputs to kafka with kafka output plugin.
The option mentioned in the issue title would be useful when telegraf is collecting metrics via push method (i.e. http_listener_v2 input) directly from end-user devices (e.g. mobile phones). In this case one of the devices might have out of sync clock (e.g. dates would be in the future). In case kafka broker is setup to use CREATE_TIME timestamps, the current behavior of kafka output (i.e. setting metric timestamp as a kafka message timestamp as well) would break retention policy as described in this medium article.
There are a few ways to deal with that (including changing broker/topic settings to use LOG_APPEND_TIME), but it would also be nice to have an option in producer config to change the producer behavior to default sarama behavior (which is, I believe, to use host timestamp).
Judging by the discussion in #6696 there was a proposal to add an option, but there was no clear benefits to it, so it wasn't added. Now that I've faced a problem with broken kafka retention caused by metrics timestamps I think there is a benefit in making an option to opt-out of this behavior.
Not sure yet, what would be a good name for this option, but as I see it, this option can be bool and
false
by default to preserve the current behavior.Expected behavior
kafka output users can configure if metric's timestamp is used as a kafka message's timestamp.
Actual behavior
kafka output users can't configure what timestamp is used.
Additional info
No response
The text was updated successfully, but these errors were encountered: