Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Duplication in InfluxDB #5394

Closed
ashleshmandke opened this issue Feb 8, 2019 · 20 comments
Closed

Data Duplication in InfluxDB #5394

ashleshmandke opened this issue Feb 8, 2019 · 20 comments
Milestone

Comments

@ashleshmandke
Copy link

ashleshmandke commented Feb 8, 2019

Hi All,

Need help in scenario that we are facing:-
We are loading data into influxdb parsing via telegraf.
We have one record in source file. Please find below snapshot of the source record-

image

After data loading above source file into InfluxDB we are getting two records when we are trying to query it.
Please find below snapshot after querying the result set of influxdb
image

It seems that influxdb creates duplicate records itself with another timestamp i.e
image

The above scenario is observed sometimes but not in each and every data loading source file.

Please help us.

Warm regrards
//Ashlesh

@danielnelson
Copy link
Contributor

Can you paste your Telegraf config for the plugin you are using?

@ashleshmandke
Copy link
Author

ashleshmandke commented Feb 11, 2019

@danielnelson : Please find attachment of telegraf config file.
I am using telegraf version 1.6.4 and influxDB version 1.4.3
Is it due to version issue? 1.5 might have fixed this issue if its a bug
telegraf.txt

Warm Regards
//Ashlesh

@danielnelson
Copy link
Contributor

I believe the cause is a little known, and undocumented, behavior of logparser, or more accurately our grok parser. If two or more consecutive lines are parsed with the same timestamp, the timestamp is adjusted in order to preserve the ordering. If the timestamp were the same they would be merged into one record in the database, potentially overwriting the earlier value.

Does this seem like the the probable cause to you based on the data file?

@ashleshmandke
Copy link
Author

ashleshmandke commented Feb 12, 2019

@danielnelson : We have been doing this influxdb logging since 1 year. Never had this issue before.
Please find actual source file and the result set from influxdb by querying it.
source_file.txt
Influx_measure_exported.xlsx

Grok pattern for the above source file:-
['%{NUMBER:field_1:int}^|^%{HOSTNAME:field_2:tag}^|^%{TIMESTAMP_ISO8601:field_3:ts-"2006-01-02 15:04:05.999999"}^|^%{NUMBER:field_4:tag}^|^%{WORD:field_5:tag}^|^%{NUMBER:field_6:float}^|^%{NUMBER:field_7:float}^|^%{NUMBER:field_8:float}^|^%{NUMBER:field_9:float}^|^%{NUMBER:field_10:float}^|^%{NUMBER:field_11:tag}']

I have highlighted the duplicate records in Influx_measure_exported.xlsx file. Please let us know how to prevent this from happening.

Warm regards
//Ashlesh

@danielnelson
Copy link
Contributor

In source_file.txt, all points have the same timestamp, which is causing the logparser plugin to adjust the timestamps as described in my last comment. The values in the .xlsx document are not actually duplicate either, each line contains different values corresponding to the lines in the .txt file.

Can you show me how you would like the database records to look for this document?

@ashleshmandke
Copy link
Author

@danielnelson : . In database shell prompt, it was looking messy so we have exported the result set in .xlsx file for better understanding.
In earlier .xlsx file you can find 1st record which is getting exactly matched with the 35th record, only difference is the timestamp as all the field are exactly matching. It is re-creating the same record again by changing the nano sec part, as you are saying it is trying to create sequence to preserve ordering.
(2019-02-08T06:10:00.00Z -> 2019-02-08T06:10:00.002Z)

Also for a year we are loading data with same timestamp scenario but never encountered such issue.

@danielnelson
Copy link
Contributor

Okay, I see the duplicates your are referring to in the xlsx file now. When I process the file though I don't get any duplicate points, here is the output (created using a file output), which shows the modifications to the source timestamp but still the same number of lines as in the source text:

test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=5.0205,field_1=46i,field_7=3.222,field_8=7.898,field_9=230.943,field_10=0.076666667 1549654800000000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.003333333,field_1=2i,field_6=5.626,field_7=5.564,field_8=5.688,field_9=11.252 1549654800001000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.003333333,field_6=4.2125,field_7=3.127,field_8=5.298,field_1=2i,field_9=8.425 1549654800002000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_8=7.947,field_10=0.068333333,field_7=3.219,field_9=180.923,field_1=41i,field_6=4.412756098 1549654800003000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.001666667,field_1=1i,field_6=3.996,field_9=3.996,field_7=3.996,field_8=3.996 1549654800004000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_1=6i,field_8=6.246,field_9=27.121,field_10=0.01,field_6=4.520166667,field_7=3.399 1549654800005000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_8=5.734,field_10=0.001666667,field_1=1i,field_7=5.734,field_9=5.734,field_6=5.734 1549654800006000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=4.92175,field_7=3.585,field_9=19.687,field_10=0.006666667,field_1=4i,field_8=5.783 1549654800007000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=4.003,field_10=0.005,field_1=3i,field_8=4.646,field_9=13.028,field_6=4.342666667 1549654800008000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=2.52,field_8=7.098,field_6=4.23465,field_9=169.386,field_10=0.066666667,field_1=40i 1549654800009000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=3.776,field_9=3.776,field_7=3.776,field_8=3.776,field_10=0.001666667,field_1=1i 1549654800010000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.733333333,field_1=440i,field_6=4.569902273,field_7=2.669,field_8=9.136,field_9=2010.757 1549654800011000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.003333333,field_7=6.151,field_8=6.569,field_9=12.72,field_1=2i,field_6=6.36 1549654800012000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.545,field_6=3.275553517,field_7=2.191,field_1=327i,field_8=10.96,field_9=1071.106 1549654800013000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_1=7i,field_6=4.621285714,field_7=2.324,field_8=9.037,field_9=32.349,field_10=0.011666667 1549654800014000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_1=1i,field_9=4.556,field_10=0.001666667,field_6=4.556,field_7=4.556,field_8=4.556 1549654800015000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_9=31.681,field_6=4.525857143,field_7=3.706,field_8=5.394,field_10=0.011666667,field_1=7i 1549654800016000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_8=7.577,field_1=76i,field_7=2.62,field_10=0.126666667,field_6=4.644828947,field_9=353.007 1549654800017000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=3.703,field_8=5.989,field_9=13.419,field_10=0.005,field_1=3i,field_6=4.473 1549654800018000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_9=8.847,field_1=2i,field_8=4.928,field_6=4.4235,field_7=3.919,field_10=0.003333333 1549654800019000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=5.879,field_8=5.879,field_9=5.879,field_1=1i,field_10=0.001666667,field_6=5.879 1549654800020000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=4.658,field_9=9.888,field_10=0.003333333,field_1=2i,field_8=5.23,field_6=4.944 1549654800021000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_8=6.415,field_1=157i,field_9=709.266,field_10=0.261666667,field_6=4.517617834,field_7=3.275 1549654800022000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_1=2i,field_7=4.121,field_6=6.636,field_10=0.003333333,field_8=9.151,field_9=13.272 1549654800023000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=4.471583333,field_7=3.327,field_9=214.636,field_8=6.148,field_10=0.08,field_1=48i 1549654800024000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_8=8.733,field_10=0.066666667,field_1=40i,field_6=4.59295,field_9=183.718,field_7=2.981 1549654800025000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_9=33.386,field_1=8i,field_7=2.896,field_8=5.175,field_6=4.17325,field_10=0.013333333 1549654800026000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_9=18.832,field_10=0.006666667,field_1=4i,field_7=3.913,field_6=4.708,field_8=5.284 1549654800027000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=1,field_5=U,host=loaner,path=/home/dbn/source_file.txt field_10=0.001666667,field_7=777.102,field_1=1i,field_6=777.102,field_9=777.102,field_8=777.102 1549654800028000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=3.529044444,field_7=2.387,field_9=158.807,field_10=0.075,field_1=45i,field_8=7.993 1549654800029000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_1=1i,field_9=3.451,field_6=3.451,field_7=3.451,field_8=3.451,field_10=0.001666667 1549654800030000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=4.917,field_9=4.917,field_1=1i,field_6=4.917,field_8=4.917,field_10=0.001666667 1549654800031000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=4.304,field_8=4.304,field_7=4.304,field_10=0.001666667,field_9=4.304,field_1=1i 1549654800032000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_9=32.916,field_1=7i,field_6=4.702285714,field_7=4.182,field_8=5.422,field_10=0.011666667 1549654800033000000

What would be helpful is if you could find a simple method for updating the file that produces the duplicates and then I try to explain what is happening.

I do see you have from_beginning = true which would result in the input file being reprocessed if Telegraf is restarted, and I can think of modifications to the input file that could result in the data you are seeing in InfluxDB. How exactly is the input file being updated?

@ashleshmandke
Copy link
Author

@danielnelson : Telegraf service is running 24 X 7 and we are not even restarting it. Also there is no modification of source file after been created neither the source file is updated. Each file is created at 10min window wherein the data as well as filename is unique and left as it is.
Are you suggesting that telegraf is reprocessing the source file and adding duplicates in InfluxDB.

@danielnelson
Copy link
Contributor

Are you suggesting that telegraf is reprocessing the source file and adding duplicates in InfluxDB.

I could see this occurring if the file was renamed, but if that's not happening then it is harder to explain where this is coming from. Is the file written in place in the location specified in the config "/tmp/abc/*.dat" or is it moved into the directory after being completely written?

@ashleshmandke
Copy link
Author

Basically, we move the files from the dedicated path to another directory after 1 day of loading.

@danielnelson
Copy link
Contributor

I suspect the data exists in more than file, normally it wouldn't matter since InfluxDB would merge lines, but because of the timestamp modification that grok does it is being inserted at differing timestamps. I suggest we fix this by adding an option to disable the timestamp adjustments behavior to the grok parser.

@danielnelson danielnelson added this to the 1.10.0 milestone Feb 15, 2019
@ashleshmandke
Copy link
Author

Thank you for adding this issue in the milestone. We would like to thank you for all your help. So when can we expect this patch release.

@danielnelson
Copy link
Contributor

Should be around the end of the February or early March.

@ashleshmandke
Copy link
Author

Thank you so much for your help

@ashleshmandke
Copy link
Author

ashleshmandke commented Feb 25, 2019

@danielnelson : There were some patches that were installed on our InfluxDB server, Please find below rpm installed during the influxDB duplication issue:-

  1. tzdata-2018i-1.el7.noarch
  2. systemd-sysv-219-62.el7_6.2.x86_64
  3. libgudev1-219-62.el7_6.2.x86_64
  4. systemd-219-62.el7_6.2.x86_64
  5. systemd-libs-219-62.el7_6.2.x86_64

Just want to know, did the above rpm affected InfluxDB server as one of the rpm "tzdata" i.e patch related to timezone.

@danielnelson
Copy link
Contributor

Those packages shouldn't cause any problems. We added the option to disable the timestamp altering, you can use it by setting unique_timestamp = "disabled" in logparser:

[[inputs.logparser]]
  files = ["file"]
  from_beginning = true

  [inputs.logparser.grok]
    patterns = ["blah"]
    unique_timestamp = "disabled"

This option will be available in 1.10-rc1, which should be released tomorrow.

@PritamKrishnaMali
Copy link

PritamKrishnaMali commented Jul 26, 2019

@danielnelson "unique_timestamp" not working for me in logparser plugin. Its sending duplicate data to the influxdb.
I am attached my access.log file, telegraf.config file and influxdb measurement output image.

telegraf version : Telegraf 1.10.0~rc1

telegraf.txt
access.log
influxdboutput

@danielnelson
Copy link
Contributor

@1995prit1331am5991 What you will need to do to avoid this is to parse the date from the logfile into the metric timestamp, instead of having it as a field. This way if the file is read twice, the points will be overwritten in the database. You can do this by making a change like this to the date part of the pattern:

- \[%{NOTSPACE:date} \+%{INT}\]
+ \[%{DATA:date:ts-"01/Jan/2006:15:04:05 -0700"}\]

@PritamKrishnaMali
Copy link

PritamKrishnaMali commented Jul 29, 2019

@danielnelson Thank you

I have done the changes in my telegraf.conf its working fine for me and its not sending duplicate data to influxdb.

Now I am facing some other issue, I have logs like the same timestamp but different API, clientIP etc.
It's taking only the first log of that timestamp but other logs of the same timestamp are not being sent to influxdb.

Attachments :
access.log
telegraf_conf.txt
influxdb_output

@danielnelson
Copy link
Contributor

InfluxDB allows for only a single value per field for each unique combination of measurement+tags+timestamp. If it receives another value with the same measurement+tags+timestamp then the new value overwrites the previous value.

Right now, all your log data is being parsed as fields which means you can only have one value per timestamp. The solution is to make the parts of the data that identify the time series into tags, you can do this by adding :tag to the grok pattern:

- %{WORD:action}
+ %{WORD:action:tag}

I would make ip, action, api, and status into tags. You still might lose some data, if the logfile contains multiple requests from the same ip,action,api,status during the same second, only the later one will be saved.

If you have more questions, best to ask them over at the InfluxData Community site.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants