-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Duplication in InfluxDB #5394
Comments
Can you paste your Telegraf config for the plugin you are using? |
@danielnelson : Please find attachment of telegraf config file. Warm Regards |
I believe the cause is a little known, and undocumented, behavior of logparser, or more accurately our grok parser. If two or more consecutive lines are parsed with the same timestamp, the timestamp is adjusted in order to preserve the ordering. If the timestamp were the same they would be merged into one record in the database, potentially overwriting the earlier value. Does this seem like the the probable cause to you based on the data file? |
@danielnelson : We have been doing this influxdb logging since 1 year. Never had this issue before. Grok pattern for the above source file:- I have highlighted the duplicate records in Influx_measure_exported.xlsx file. Please let us know how to prevent this from happening. Warm regards |
In source_file.txt, all points have the same timestamp, which is causing the logparser plugin to adjust the timestamps as described in my last comment. The values in the .xlsx document are not actually duplicate either, each line contains different values corresponding to the lines in the .txt file. Can you show me how you would like the database records to look for this document? |
@danielnelson : . In database shell prompt, it was looking messy so we have exported the result set in .xlsx file for better understanding. Also for a year we are loading data with same timestamp scenario but never encountered such issue. |
Okay, I see the duplicates your are referring to in the xlsx file now. When I process the file though I don't get any duplicate points, here is the output (created using a file output), which shows the modifications to the source timestamp but still the same number of lines as in the source text:
What would be helpful is if you could find a simple method for updating the file that produces the duplicates and then I try to explain what is happening. I do see you have |
@danielnelson : Telegraf service is running 24 X 7 and we are not even restarting it. Also there is no modification of source file after been created neither the source file is updated. Each file is created at 10min window wherein the data as well as filename is unique and left as it is. |
I could see this occurring if the file was renamed, but if that's not happening then it is harder to explain where this is coming from. Is the file written in place in the location specified in the config |
Basically, we move the files from the dedicated path to another directory after 1 day of loading. |
I suspect the data exists in more than file, normally it wouldn't matter since InfluxDB would merge lines, but because of the timestamp modification that grok does it is being inserted at differing timestamps. I suggest we fix this by adding an option to disable the timestamp adjustments behavior to the grok parser. |
Thank you for adding this issue in the milestone. We would like to thank you for all your help. So when can we expect this patch release. |
Should be around the end of the February or early March. |
Thank you so much for your help |
@danielnelson : There were some patches that were installed on our InfluxDB server, Please find below rpm installed during the influxDB duplication issue:-
Just want to know, did the above rpm affected InfluxDB server as one of the rpm "tzdata" i.e patch related to timezone. |
Those packages shouldn't cause any problems. We added the option to disable the timestamp altering, you can use it by setting
This option will be available in 1.10-rc1, which should be released tomorrow. |
@danielnelson "unique_timestamp" not working for me in logparser plugin. Its sending duplicate data to the influxdb. telegraf version : Telegraf 1.10.0~rc1 |
@1995prit1331am5991 What you will need to do to avoid this is to parse the date from the logfile into the metric timestamp, instead of having it as a field. This way if the file is read twice, the points will be overwritten in the database. You can do this by making a change like this to the date part of the pattern: - \[%{NOTSPACE:date} \+%{INT}\]
+ \[%{DATA:date:ts-"01/Jan/2006:15:04:05 -0700"}\] |
@danielnelson Thank you I have done the changes in my telegraf.conf its working fine for me and its not sending duplicate data to influxdb. Now I am facing some other issue, I have logs like the same timestamp but different API, clientIP etc. Attachments : |
InfluxDB allows for only a single value per field for each unique combination of measurement+tags+timestamp. If it receives another value with the same measurement+tags+timestamp then the new value overwrites the previous value. Right now, all your log data is being parsed as fields which means you can only have one value per timestamp. The solution is to make the parts of the data that identify the time series into tags, you can do this by adding - %{WORD:action}
+ %{WORD:action:tag} I would make ip, action, api, and status into tags. You still might lose some data, if the logfile contains multiple requests from the same ip,action,api,status during the same second, only the later one will be saved. If you have more questions, best to ask them over at the InfluxData Community site. |
Hi All,
Need help in scenario that we are facing:-
We are loading data into influxdb parsing via telegraf.
We have one record in source file. Please find below snapshot of the source record-
After data loading above source file into InfluxDB we are getting two records when we are trying to query it.
Please find below snapshot after querying the result set of influxdb
It seems that influxdb creates duplicate records itself with another timestamp i.e
The above scenario is observed sometimes but not in each and every data loading source file.
Please help us.
Warm regrards
//Ashlesh
The text was updated successfully, but these errors were encountered: