Data Duplication in InfluxDB #5394

ashleshmandke · 2019-02-08T12:24:48Z

Hi All,

Need help in scenario that we are facing:-
We are loading data into influxdb parsing via telegraf.
We have one record in source file. Please find below snapshot of the source record-

After data loading above source file into InfluxDB we are getting two records when we are trying to query it.
Please find below snapshot after querying the result set of influxdb

It seems that influxdb creates duplicate records itself with another timestamp i.e

The above scenario is observed sometimes but not in each and every data loading source file.

Please help us.

Warm regrards
//Ashlesh

danielnelson · 2019-02-08T20:34:10Z

Can you paste your Telegraf config for the plugin you are using?

ashleshmandke · 2019-02-11T04:57:35Z

@danielnelson : Please find attachment of telegraf config file.
I am using telegraf version 1.6.4 and influxDB version 1.4.3
Is it due to version issue? 1.5 might have fixed this issue if its a bug
telegraf.txt

Warm Regards
//Ashlesh

danielnelson · 2019-02-11T21:40:22Z

I believe the cause is a little known, and undocumented, behavior of logparser, or more accurately our grok parser. If two or more consecutive lines are parsed with the same timestamp, the timestamp is adjusted in order to preserve the ordering. If the timestamp were the same they would be merged into one record in the database, potentially overwriting the earlier value.

Does this seem like the the probable cause to you based on the data file?

ashleshmandke · 2019-02-12T09:11:28Z

@danielnelson : We have been doing this influxdb logging since 1 year. Never had this issue before.
Please find actual source file and the result set from influxdb by querying it.
source_file.txt
Influx_measure_exported.xlsx

Grok pattern for the above source file:-
['%{NUMBER:field_1:int}^|^%{HOSTNAME:field_2:tag}^|^%{TIMESTAMP_ISO8601:field_3:ts-"2006-01-02 15:04:05.999999"}^|^%{NUMBER:field_4:tag}^|^%{WORD:field_5:tag}^|^%{NUMBER:field_6:float}^|^%{NUMBER:field_7:float}^|^%{NUMBER:field_8:float}^|^%{NUMBER:field_9:float}^|^%{NUMBER:field_10:float}^|^%{NUMBER:field_11:tag}']

I have highlighted the duplicate records in Influx_measure_exported.xlsx file. Please let us know how to prevent this from happening.

Warm regards
//Ashlesh

danielnelson · 2019-02-12T22:46:28Z

In source_file.txt, all points have the same timestamp, which is causing the logparser plugin to adjust the timestamps as described in my last comment. The values in the .xlsx document are not actually duplicate either, each line contains different values corresponding to the lines in the .txt file.

Can you show me how you would like the database records to look for this document?

ashleshmandke · 2019-02-13T06:13:29Z

@danielnelson : . In database shell prompt, it was looking messy so we have exported the result set in .xlsx file for better understanding.
In earlier .xlsx file you can find 1st record which is getting exactly matched with the 35th record, only difference is the timestamp as all the field are exactly matching. It is re-creating the same record again by changing the nano sec part, as you are saying it is trying to create sequence to preserve ordering.
(2019-02-08T06:10:00.00Z -> 2019-02-08T06:10:00.002Z)

Also for a year we are loading data with same timestamp scenario but never encountered such issue.

danielnelson · 2019-02-13T21:29:31Z

Okay, I see the duplicates your are referring to in the xlsx file now. When I process the file though I don't get any duplicate points, here is the output (created using a file output), which shows the modifications to the source timestamp but still the same number of lines as in the source text:

test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=5.0205,field_1=46i,field_7=3.222,field_8=7.898,field_9=230.943,field_10=0.076666667 1549654800000000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.003333333,field_1=2i,field_6=5.626,field_7=5.564,field_8=5.688,field_9=11.252 1549654800001000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.003333333,field_6=4.2125,field_7=3.127,field_8=5.298,field_1=2i,field_9=8.425 1549654800002000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_8=7.947,field_10=0.068333333,field_7=3.219,field_9=180.923,field_1=41i,field_6=4.412756098 1549654800003000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.001666667,field_1=1i,field_6=3.996,field_9=3.996,field_7=3.996,field_8=3.996 1549654800004000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_1=6i,field_8=6.246,field_9=27.121,field_10=0.01,field_6=4.520166667,field_7=3.399 1549654800005000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_8=5.734,field_10=0.001666667,field_1=1i,field_7=5.734,field_9=5.734,field_6=5.734 1549654800006000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=4.92175,field_7=3.585,field_9=19.687,field_10=0.006666667,field_1=4i,field_8=5.783 1549654800007000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=4.003,field_10=0.005,field_1=3i,field_8=4.646,field_9=13.028,field_6=4.342666667 1549654800008000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=2.52,field_8=7.098,field_6=4.23465,field_9=169.386,field_10=0.066666667,field_1=40i 1549654800009000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=3.776,field_9=3.776,field_7=3.776,field_8=3.776,field_10=0.001666667,field_1=1i 1549654800010000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.733333333,field_1=440i,field_6=4.569902273,field_7=2.669,field_8=9.136,field_9=2010.757 1549654800011000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.003333333,field_7=6.151,field_8=6.569,field_9=12.72,field_1=2i,field_6=6.36 1549654800012000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_10=0.545,field_6=3.275553517,field_7=2.191,field_1=327i,field_8=10.96,field_9=1071.106 1549654800013000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_1=7i,field_6=4.621285714,field_7=2.324,field_8=9.037,field_9=32.349,field_10=0.011666667 1549654800014000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_1=1i,field_9=4.556,field_10=0.001666667,field_6=4.556,field_7=4.556,field_8=4.556 1549654800015000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_9=31.681,field_6=4.525857143,field_7=3.706,field_8=5.394,field_10=0.011666667,field_1=7i 1549654800016000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_8=7.577,field_1=76i,field_7=2.62,field_10=0.126666667,field_6=4.644828947,field_9=353.007 1549654800017000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=3.703,field_8=5.989,field_9=13.419,field_10=0.005,field_1=3i,field_6=4.473 1549654800018000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_9=8.847,field_1=2i,field_8=4.928,field_6=4.4235,field_7=3.919,field_10=0.003333333 1549654800019000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=5.879,field_8=5.879,field_9=5.879,field_1=1i,field_10=0.001666667,field_6=5.879 1549654800020000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=4.658,field_9=9.888,field_10=0.003333333,field_1=2i,field_8=5.23,field_6=4.944 1549654800021000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_8=6.415,field_1=157i,field_9=709.266,field_10=0.261666667,field_6=4.517617834,field_7=3.275 1549654800022000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_1=2i,field_7=4.121,field_6=6.636,field_10=0.003333333,field_8=9.151,field_9=13.272 1549654800023000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=4.471583333,field_7=3.327,field_9=214.636,field_8=6.148,field_10=0.08,field_1=48i 1549654800024000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_8=8.733,field_10=0.066666667,field_1=40i,field_6=4.59295,field_9=183.718,field_7=2.981 1549654800025000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_9=33.386,field_1=8i,field_7=2.896,field_8=5.175,field_6=4.17325,field_10=0.013333333 1549654800026000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_9=18.832,field_10=0.006666667,field_1=4i,field_7=3.913,field_6=4.708,field_8=5.284 1549654800027000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=1,field_5=U,host=loaner,path=/home/dbn/source_file.txt field_10=0.001666667,field_7=777.102,field_1=1i,field_6=777.102,field_9=777.102,field_8=777.102 1549654800028000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=3.529044444,field_7=2.387,field_9=158.807,field_10=0.075,field_1=45i,field_8=7.993 1549654800029000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_1=1i,field_9=3.451,field_6=3.451,field_7=3.451,field_8=3.451,field_10=0.001666667 1549654800030000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_7=4.917,field_9=4.917,field_1=1i,field_6=4.917,field_8=4.917,field_10=0.001666667 1549654800031000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_6=4.304,field_8=4.304,field_7=4.304,field_10=0.001666667,field_9=4.304,field_1=1i 1549654800032000000
test,IP=<localhost>,application_name=telegraf,field_11=1,field_2=n1pl-pa-wsf02,field_4=5,field_5=R,host=loaner,path=/home/dbn/source_file.txt field_9=32.916,field_1=7i,field_6=4.702285714,field_7=4.182,field_8=5.422,field_10=0.011666667 1549654800033000000

What would be helpful is if you could find a simple method for updating the file that produces the duplicates and then I try to explain what is happening.

I do see you have from_beginning = true which would result in the input file being reprocessed if Telegraf is restarted, and I can think of modifications to the input file that could result in the data you are seeing in InfluxDB. How exactly is the input file being updated?

ashleshmandke · 2019-02-14T12:52:32Z

@danielnelson : Telegraf service is running 24 X 7 and we are not even restarting it. Also there is no modification of source file after been created neither the source file is updated. Each file is created at 10min window wherein the data as well as filename is unique and left as it is.
Are you suggesting that telegraf is reprocessing the source file and adding duplicates in InfluxDB.

danielnelson · 2019-02-14T20:36:17Z

Are you suggesting that telegraf is reprocessing the source file and adding duplicates in InfluxDB.

I could see this occurring if the file was renamed, but if that's not happening then it is harder to explain where this is coming from. Is the file written in place in the location specified in the config "/tmp/abc/*.dat" or is it moved into the directory after being completely written?

ashleshmandke · 2019-02-15T11:53:53Z

Basically, we move the files from the dedicated path to another directory after 1 day of loading.

danielnelson · 2019-02-15T20:04:05Z

I suspect the data exists in more than file, normally it wouldn't matter since InfluxDB would merge lines, but because of the timestamp modification that grok does it is being inserted at differing timestamps. I suggest we fix this by adding an option to disable the timestamp adjustments behavior to the grok parser.

ashleshmandke · 2019-02-18T05:15:39Z

Thank you for adding this issue in the milestone. We would like to thank you for all your help. So when can we expect this patch release.

danielnelson · 2019-02-20T00:32:02Z

Should be around the end of the February or early March.

ashleshmandke · 2019-02-20T05:51:38Z

Thank you so much for your help

ashleshmandke · 2019-02-25T12:08:23Z

@danielnelson : There were some patches that were installed on our InfluxDB server, Please find below rpm installed during the influxDB duplication issue:-

tzdata-2018i-1.el7.noarch
systemd-sysv-219-62.el7_6.2.x86_64
libgudev1-219-62.el7_6.2.x86_64
systemd-219-62.el7_6.2.x86_64
systemd-libs-219-62.el7_6.2.x86_64

Just want to know, did the above rpm affected InfluxDB server as one of the rpm "tzdata" i.e patch related to timezone.

danielnelson · 2019-02-27T02:09:30Z

Those packages shouldn't cause any problems. We added the option to disable the timestamp altering, you can use it by setting unique_timestamp = "disabled" in logparser:

[[inputs.logparser]]
  files = ["file"]
  from_beginning = true

  [inputs.logparser.grok]
    patterns = ["blah"]
    unique_timestamp = "disabled"

This option will be available in 1.10-rc1, which should be released tomorrow.

PritamKrishnaMali · 2019-07-26T07:20:13Z

@danielnelson "unique_timestamp" not working for me in logparser plugin. Its sending duplicate data to the influxdb.
I am attached my access.log file, telegraf.config file and influxdb measurement output image.

telegraf version : Telegraf 1.10.0~rc1

telegraf.txt
access.log

danielnelson · 2019-07-26T20:18:49Z

@1995prit1331am5991 What you will need to do to avoid this is to parse the date from the logfile into the metric timestamp, instead of having it as a field. This way if the file is read twice, the points will be overwritten in the database. You can do this by making a change like this to the date part of the pattern:

- \[%{NOTSPACE:date} \+%{INT}\]
+ \[%{DATA:date:ts-"01/Jan/2006:15:04:05 -0700"}\]

PritamKrishnaMali · 2019-07-29T10:33:51Z

@danielnelson Thank you

I have done the changes in my telegraf.conf its working fine for me and its not sending duplicate data to influxdb.

Now I am facing some other issue, I have logs like the same timestamp but different API, clientIP etc.
It's taking only the first log of that timestamp but other logs of the same timestamp are not being sent to influxdb.

Attachments :
access.log
telegraf_conf.txt

danielnelson · 2019-07-29T22:36:21Z

InfluxDB allows for only a single value per field for each unique combination of measurement+tags+timestamp. If it receives another value with the same measurement+tags+timestamp then the new value overwrites the previous value.

Right now, all your log data is being parsed as fields which means you can only have one value per timestamp. The solution is to make the parts of the data that identify the time series into tags, you can do this by adding :tag to the grok pattern:

- %{WORD:action}
+ %{WORD:action:tag}

I would make ip, action, api, and status into tags. You still might lose some data, if the logfile contains multiple requests from the same ip,action,api,status during the same second, only the later one will be saved.

If you have more questions, best to ask them over at the InfluxData Community site.

danielnelson added the need more info label Feb 8, 2019

danielnelson added the area/tail label Feb 11, 2019

danielnelson removed the need more info label Feb 15, 2019

danielnelson added this to the 1.10.0 milestone Feb 15, 2019

glinton mentioned this issue Feb 26, 2019

Add option to disable timestamp adjustment in grok parser #5488

Merged

danielnelson closed this as completed in #5488 Feb 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Duplication in InfluxDB #5394

Data Duplication in InfluxDB #5394

ashleshmandke commented Feb 8, 2019 •

edited

Loading

danielnelson commented Feb 8, 2019

ashleshmandke commented Feb 11, 2019 •

edited

Loading

danielnelson commented Feb 11, 2019

ashleshmandke commented Feb 12, 2019 •

edited

Loading

danielnelson commented Feb 12, 2019

ashleshmandke commented Feb 13, 2019

danielnelson commented Feb 13, 2019

ashleshmandke commented Feb 14, 2019

danielnelson commented Feb 14, 2019

ashleshmandke commented Feb 15, 2019

danielnelson commented Feb 15, 2019

ashleshmandke commented Feb 18, 2019

danielnelson commented Feb 20, 2019

ashleshmandke commented Feb 20, 2019

ashleshmandke commented Feb 25, 2019 •

edited

Loading

danielnelson commented Feb 27, 2019

PritamKrishnaMali commented Jul 26, 2019 •

edited

Loading

danielnelson commented Jul 26, 2019

PritamKrishnaMali commented Jul 29, 2019 •

edited

Loading

danielnelson commented Jul 29, 2019

Data Duplication in InfluxDB #5394

Data Duplication in InfluxDB #5394

Comments

ashleshmandke commented Feb 8, 2019 • edited Loading

danielnelson commented Feb 8, 2019

ashleshmandke commented Feb 11, 2019 • edited Loading

danielnelson commented Feb 11, 2019

ashleshmandke commented Feb 12, 2019 • edited Loading

danielnelson commented Feb 12, 2019

ashleshmandke commented Feb 13, 2019

danielnelson commented Feb 13, 2019

ashleshmandke commented Feb 14, 2019

danielnelson commented Feb 14, 2019

ashleshmandke commented Feb 15, 2019

danielnelson commented Feb 15, 2019

ashleshmandke commented Feb 18, 2019

danielnelson commented Feb 20, 2019

ashleshmandke commented Feb 20, 2019

ashleshmandke commented Feb 25, 2019 • edited Loading

danielnelson commented Feb 27, 2019

PritamKrishnaMali commented Jul 26, 2019 • edited Loading

danielnelson commented Jul 26, 2019

PritamKrishnaMali commented Jul 29, 2019 • edited Loading

danielnelson commented Jul 29, 2019

ashleshmandke commented Feb 8, 2019 •

edited

Loading

ashleshmandke commented Feb 11, 2019 •

edited

Loading

ashleshmandke commented Feb 12, 2019 •

edited

Loading

ashleshmandke commented Feb 25, 2019 •

edited

Loading

PritamKrishnaMali commented Jul 26, 2019 •

edited

Loading

PritamKrishnaMali commented Jul 29, 2019 •

edited

Loading