Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: CSV parser does not handle csv_skip_rows correctly #10347

Closed
DennisBayer opened this issue Dec 24, 2021 · 7 comments
Closed

Regression: CSV parser does not handle csv_skip_rows correctly #10347

DennisBayer opened this issue Dec 24, 2021 · 7 comments
Labels
area/exec bug unexpected problem or unintended behavior

Comments

@DennisBayer
Copy link

Relevent telegraf.conf

# (equal for 1.20.4 and 1.21.1)

[[inputs.exec]]
  name_override = "mtr"
  interval = "60s"
  commands = [
    "mtr -i3 -C exampleA.com",
    "mtr -i3 -C 8.8.8.8",
    "mtr -i3 -C exampleB.com"
  ]
  timeout = "55s"
  data_format = "csv"
  csv_skip_rows = 1
  csv_column_names = [ "", "", "status", "dest", "hop", "ip", "loss", "snt", "", "", "avg", "best", "worst", "stdev"]
  csv_tag_columns = ["dest", "hop", "ip"]

Logs from Telegraf

Note: I added new lines in the test run between each mtr output for easier reading.

1.20.4

2021-12-24T16:55:20Z I! Starting Telegraf 1.20.4
2021-12-24T16:55:20Z I! Using config file: /etc/telegraf/telegraf.conf
2021-12-24T16:55:20Z I! Loaded inputs: cpu disk diskio docker exec kernel mem net ping processes sensors swap system
2021-12-24T16:55:20Z I! Loaded aggregators: 
2021-12-24T16:55:20Z I! Loaded processors: regex
2021-12-24T16:55:20Z I! Loaded outputs: influxdb_v2
2021-12-24T16:55:20Z I! Tags enabled: host=mimas
2021-12-24T16:55:20Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"mimas", Flush Interval:10s
> no further logs
telegraf@mimas:/$ telegraf --config /etc/telegraf/telegraf.conf --input-filter exec --test --debug
2021-12-24T16:55:52Z I! Starting Telegraf 1.20.4
2021-12-24T16:55:52Z D! [agent] Initializing plugins
2021-12-24T16:55:52Z D! [agent] Starting service inputs
> mtr,dest=10.30.2.2,hop=1,hoppad=01,host=mimas,ip=192.168.178.4 avg=0.28,best=0.23,loss=60,snt=10i,status="OK",stdev=0.06,worst=0.36 1640364989000000000
[snip]
> mtr,dest=10.30.2.2,hop=13,hoppad=13,host=mimas,ip=10.30.2.2 avg=10.86,best=9.58,loss=0,snt=10i,status="OK",stdev=0.8,worst=11.94 1640364989000000000

> mtr,dest=8.8.8.8,hop=1,hoppad=01,host=mimas,ip=192.168.178.4 avg=0.23,best=0.18,loss=0,snt=10i,status="OK",stdev=0.05,worst=0.35 1640364990000000000
[snip]
> mtr,dest=8.8.8.8,hop=16,hoppad=16,host=mimas,ip=dns.google avg=20.88,best=14.3,loss=0,snt=10i,status="OK",stdev=4.45,worst=28.09 1640364990000000000
2021-12-24T16:56:29Z D! [agent] Stopping service inputs
2021-12-24T16:56:29Z D! [agent] Input channel closed
2021-12-24T16:56:29Z D! [agent] Processor channel closed
2021-12-24T16:56:29Z D! [agent] Stopped Successfully

> mtr,dest=example.com,hop=1,hoppad=01,host=mimas,ip=192.168.178.4 avg=0.22,best=0.18,loss=60,snt=10i,status="OK",stdev=0.04,worst=0.26 1640364990000000000
[snip]
> mtr,dest=example.com,hop=17,hoppad=17,host=mimas,ip=com.example avg=23.26,best=19.27,loss=0,snt=10i,status="OK",stdev=3.42,worst=29.38 1640364990000000000
telegraf@mimas:/$ 
telegraf@mimas:/$ mtr -C 8.8.8.8
Mtr_Version,Start_Time,Status,Host,Hop,Ip,Loss%,Snt, ,Last,Avg,Best,Wrst,StDev,
MTR.0.92,1640365204,OK,8.8.8.8,1,192.168.178.4,0.00,10,0,0.26,0.23,0.20,0.33,0.04
[snip]
MTR.0.92,1640365204,OK,8.8.8.8,16,dns.google,0.00,10,0,24.36,44.54,18.01,199.74,54.84

1.21.1

2021-12-24T17:12:33Z I! Starting Telegraf 1.21.1
2021-12-24T17:12:33Z I! Using config file: /etc/telegraf/telegraf.conf
2021-12-24T17:12:33Z I! Loaded inputs: cpu disk diskio docker exec kernel mem net ping processes sensors swap system
2021-12-24T17:12:33Z I! Loaded aggregators: 
2021-12-24T17:12:33Z I! Loaded processors: regex
2021-12-24T17:12:33Z I! Loaded outputs: influxdb_v2
2021-12-24T17:12:33Z I! Tags enabled: host=mimas
2021-12-24T17:12:33Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"mimas", Flush Interval:10s
2021-12-24T17:13:45Z E! [outputs.influxdb_v2] Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "loss" on measurement "mtr" is type string, already exists as type float dropped=2
2021-12-24T17:14:43Z E! [outputs.influxdb_v2] Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "best" on measurement "mtr" is type string, already exists as type float dropped=3
2021-12-24T17:15:44Z E! [outputs.influxdb_v2] Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "loss" on measurement "mtr" is type string, already exists as type float dropped=3
2021-12-24T16:44:17Z I! Starting Telegraf 1.21.1
2021-12-24T16:44:17Z I! Loaded inputs: exec
2021-12-24T16:44:17Z I! Loaded aggregators: 
2021-12-24T16:44:17Z I! Loaded processors: regex
2021-12-24T16:44:17Z I! W! Outputs are not used in testing mode!
2021-12-24T16:44:17Z I! Tags enabled: host=mimas
2021-12-24T16:44:17Z D! [agent] Initializing plugins
2021-12-24T16:44:17Z D! [agent] Starting service inputs

telegraf@mimas:/$ telegraf --config /etc/telegraf/telegraf.conf --input-filter exec --test --debug
> mtr,dest=10.30.2.2,hop=1,hoppad=01,host=mimas,ip=192.168.178.4 avg=0.23,best=0.2,loss=0,snt=10i,status="OK",stdev=0.04,worst=0.3 1640364294000000000
[snip]
> mtr,dest=10.30.2.2,hop=13,hoppad=13,host=mimas,ip=10.30.2.2 avg=11.01,best=9.28,loss=0,snt=10i,status="OK",stdev=0.9,worst=12.04 1640364294000000000

> mtr,dest=Host,hop=Hop,host=mimas,ip=Ip avg="Avg",best="Best",loss="Loss%",snt="Snt",status="Status",stdev="StDev",worst="Wrst" 1640364294000000000
> mtr,dest=8.8.8.8,hop=1,hoppad=01,host=mimas,ip=192.168.178.4 avg=0.21,best=0.17,loss=10,snt=10i,status="OK",stdev=0.02,worst=0.24 1640364294000000000
[snip]
> mtr,dest=8.8.8.8,hop=16,hoppad=16,host=mimas,ip=dns.google avg=20.8,best=15.37,loss=0,snt=10i,status="OK",stdev=4.08,worst=29.39 1640364294000000000
2021-12-24T16:44:54Z D! [agent] Stopping service inputs
2021-12-24T16:44:54Z D! [agent] Input channel closed

> mtr,dest=Host,hop=Hop,host=mimas,ip=Ip avg="Avg",best="Best",loss="Loss%",snt="Snt",status="Status",stdev="StDev",worst="Wrst" 1640364294000000000
2021-12-24T16:44:54Z D! [agent] Processor channel closed
> mtr,dest=example.com,hop=1,hoppad=01,host=mimas,ip=192.168.178.4 avg=0.22,best=0.16,loss=20,snt=10i,status="OK",stdev=0.03,worst=0.25 1640364294000000000
2021-12-24T16:44:54Z D! [agent] Stopped Successfully
[snip]
> mtr,dest=example.com,hop=17,hoppad=17,host=mimas,ip=com.examplee avg=28.07,best=21.43,loss=0,snt=10i,status="OK",stdev=7.77,worst=48.27 1640364294000000000
telegraf@mimas:/$ 
telegraf@mimas:/$ mtr -C 8.8.8.8
Mtr_Version,Start_Time,Status,Host,Hop,Ip,Loss%,Snt, ,Last,Avg,Best,Wrst,StDev,
MTR.0.92,1640361652,OK,8.8.8.8,1,192.168.178.4,0.00,10,0,0.29,0.25,0.20,0.38,0.05
[snip]
MTR.0.92,1640361652,OK,8.8.8.8,16,dns.google,0.00,10,0,19.24,26.25,17.39,36.88,7.37

System info

Telegraf 1.21.1; Docker Compose version 2.2.2; Docker version 20.10.11, build dea9396e18; Manjaro

Docker

./telegraf/Dockerfile-mtr

ARG IMAGE_NAME
ARG IMAGE_VERSION
 
FROM ${IMAGE_NAME:-telegraf}:${IMAGE_VERSION:-1.21.1}
 
RUN apt-get update && apt-get install -y --no-install-recommends mtr-tiny && \
  rm -rf /var/lib/apt/lists/*

./docker-compose.yml

telegraf:
  image: telegraf:1.21.1.mtr
  build:
    context: ./telegraf
    dockerfile: Dockerfile-mtr
    args:
      - IMAGE_NAME=telegraf
      - IMAGE_VERSION=1.21.1
  user: telegraf:963
  # Take care of MTR zombie processes
  init: true
  restart: unless-stopped
  network_mode: "host"
  volumes:
    # Local telegraf config
    - ./telegraf/config/telegraf.conf:/etc/telegraf/telegraf.conf:ro
    # Monitor docker engine host
    - /:/hostfs:ro
    # Monitor docker containers
    - /var/run/docker.sock:/var/run/docker.sock
  environment:
    # Monitor docker engine host
    - HOST_MOUNT_PREFIX=/hostfs
    - HOST_ETC=/hostfs/etc
    - HOST_PROC=/hostfs/proc
    - HOST_SYS=/hostfs/sys
    - HOST_VAR=/hostfs/var
    - HOST_RUN=/hostfs/run

Steps to reproduce

  1. Configuring mtr to trace three targets and emit each result as csv.
  2. Start telegraf container using version 1.21.1 using the same environment as before (except user-value in docker-compose)
  3. Log contains error entries

Expected behavior

  • Each CSV output is parsed properly
  • Header row of each CSV result of mtr is ignored and the custom CSV header mapping is used as in 1.20.4

Actual behavior

  • CSV is not parsed correctly, because the header row is only skipped for the first output of mtr. The header of the 2nd and 3rd output is processed.
  • Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "best" on measurement "mtr" is type string, already exists as type float dropped=3 is logged
  • The test run between 1.20.4 and 1.21.1 differs in containing the lines mtr,dest=Host,hop=Hop,host=mimas,ip=Ip avg="Avg",best="Best",loss="Loss%",snt="Snt",status="Status",stdev="StDev",worst="Wrst" 1640364294000000000 for the latter one.

Additional info

Since I did not find any release notes about a changed behavior, I assume that this is a regression.

Reading the release notes pointed me to a fix #9865 which is related to csv_skip_rows and csv_header_row_count.
@etycomputer: Might this behavior be related to your fix in any way or am I on the wrong track?

Any combination of csv_skip_rows = 0|1 and csv_header_row_count = 0|1 did not result in any success.

@DennisBayer DennisBayer added the bug unexpected problem or unintended behavior label Dec 24, 2021
@jjh74
Copy link
Contributor

jjh74 commented Dec 26, 2021

I think problem is #9865: commit (db86904#diff-2c668efeba00e299cb699055ee85b8a114f60aae8d7684c4f4f70ac2bc1b70f4) changes skiprow loop from:

for i := 0; i < p.SkipRows; i++

to

for p.SkipRows > 0 {
...
p.SkipRows--

So on second input p.SkipRows is already zero. (Probaly similar problem with p.HeaderRowCount).

As a workaround you could try splitting [[inputs.exec]] into multiple inputs each with only one commands:

[[inputs.exec]]
  name_override = "mtr"
  interval = "60s"
  commands = [ "mtr -i3 -C exampleA.com" ]
  timeout = "55s"
  data_format = "csv"
  csv_skip_rows = 1
  csv_column_names = [ "", "", "status", "dest", "hop", "ip", "loss", "snt", "", "", "avg", "best", "worst", "stdev"]
  csv_tag_columns = ["dest", "hop", "ip"]

[[inputs.exec]]
  name_override = "mtr"
  interval = "60s"
  commands = [ "mtr -i3 -C 8.8.8.8" ]
  timeout = "55s"
  data_format = "csv"
  csv_skip_rows = 1
  csv_column_names = [ "", "", "status", "dest", "hop", "ip", "loss", "snt", "", "", "avg", "best", "worst", "stdev"]
  csv_tag_columns = ["dest", "hop", "ip"]
...

@DennisBayer
Copy link
Author

DennisBayer commented Dec 27, 2021

Thanks for reaching out and digging in the changeset.

Yeah, seems like there is some hassle with the counter, which should not be decreased - at least not the original config value. Something like?!:

var loopSkipRowCounter = p.SkipRows
for loopSkipRowCounter > 0 {
...
loopSkipRowCounter--

Unfortunately, I'm not proficient enough in the go language nor in the project architecture to provide a patch.

Related to your workaround:

The debug run looks good, ...

telegraf@mimas:/$ telegraf --config /etc/telegraf/telegraf.conf --input-filter exec --test --debug
2021-12-27T09:52:08Z I! Starting Telegraf 1.21.1
2021-12-27T09:52:08Z I! Loaded inputs: exec (3x)
2021-12-27T09:52:08Z I! Loaded aggregators: 
2021-12-27T09:52:08Z I! Loaded processors: regex
2021-12-27T09:52:08Z I! W! Outputs are not used in testing mode!
2021-12-27T09:52:08Z I! Tags enabled: host=mimas
2021-12-27T09:52:08Z D! [agent] Initializing plugins
2021-12-27T09:52:08Z D! [agent] Starting service inputs

> mtr,dest=10.30.2.2,hop=1,hoppad=01,host=mimas,ip=192.168.178.4 avg=0.28,best=0.24,loss=0,snt=10i,status="OK",stdev=0.04,worst=0.39 1640598765000000000
[snip]
> mtr,dest=10.30.2.2,hop=13,hoppad=13,host=mimas,ip=10.30.2.2 avg=9.65,best=8.4,loss=0,snt=10i,status="OK",stdev=1.18,worst=12.46 1640598765000000000

> mtr,dest=8.8.8.8,hop=1,hoppad=01,host=mimas,ip=192.168.178.4 avg=0.73,best=0.24,loss=0,snt=10i,status="OK",stdev=1.46,worst=4.9 1640598766000000000
[snip]
> mtr,dest=8.8.8.8,hop=16,hoppad=16,host=mimas,ip=dns.google avg=24.04,best=16.52,loss=0,snt=10i,status="OK",stdev=7.71,worst=38.86 1640598766000000000

2021-12-27T09:52:46Z D! [agent] Stopping service inputs
2021-12-27T09:52:46Z D! [agent] Input channel closed
2021-12-27T09:52:46Z D! [agent] Processor channel closed
2021-12-27T09:52:46Z D! [agent] Stopped Successfully

> mtr,dest=exampleA.com,hop=1,hoppad=01,host=mimas,ip=192.168.178.4 avg=0.28,best=0.21,loss=60,snt=10i,status="OK",stdev=0.06,worst=0.33 1640598766000000000
[snip]
> mtr,dest=exampleA.com,hop=17,hoppad=17,host=mimas,ip=com.exampleA avg=26.96,best=21.72,loss=0,snt=10i,status="OK",stdev=4.14,worst=34.13 1640598766000000000
telegraf@mimas:/$ 

...but the log still complains about invalid values:

2021-12-27T09:54:39Z I! Starting Telegraf 1.21.1
2021-12-27T09:54:39Z I! Using config file: /etc/telegraf/telegraf.conf
2021-12-27T09:54:39Z I! Loaded inputs: cpu disk diskio docker exec (3x) kernel mem net ping processes sensors swap system
2021-12-27T09:54:39Z I! Loaded aggregators: 
2021-12-27T09:54:39Z I! Loaded processors: regex
2021-12-27T09:54:39Z I! Loaded outputs: influxdb_v2
2021-12-27T09:54:39Z I! Tags enabled: host=mimas
2021-12-27T09:54:39Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"mimas", Flush Interval:10s
2021-12-27T09:56:40Z E! [outputs.influxdb_v2] Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "loss" on measurement "mtr" is type string, already exists as type float dropped=3
2021-12-27T09:57:40Z E! [outputs.influxdb_v2] Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "loss" on measurement "mtr" is type string, already exists as type float dropped=3
2021-12-27T09:58:44Z E! [outputs.influxdb_v2] Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "snt" on measurement "mtr" is type string, already exists as type integer dropped=3
2021-12-27T09:59:37Z E! [outputs.influxdb_v2] Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "worst" on measurement "mtr" is type string, already exists as type float dropped=1
2021-12-27T09:59:51Z E! [outputs.influxdb_v2] Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "worst" on measurement "mtr" is type string, already exists as type float dropped=2
2021-12-27T10:00:43Z E! [outputs.influxdb_v2] Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "loss" on measurement "mtr" is type string, already exists as type float dropped=3
2021-12-27T10:01:47Z E! [outputs.influxdb_v2] Failed to write metric (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: field type conflict: input field "best" on measurement "mtr" is type string, already exists as type float dropped=3

Are the outputs collected in some sort of cache and batch processed during the flush?

@byte-opps
Copy link

byte-opps commented Dec 31, 2021

I have a simular issue with skip-headers in [[inputs.exec]] running a powershell script that spits out a two column csv with headers. The first loop is fine and writes to DB. In the next telegraf agent loop/execution the csv header is not skipped, and so the write to the database fails the data type check and no further data is recored.
Also this is when running as a windows service, the same result when running from powershell. --test and -- debug work as expected for the 1st agent loop but the --test only runs once, would be nice if the test went more times, as i think this would show the issues.
This is not a permissions issue as it is running once correctly and is running agian to get the CSV again which is then not pasrsed correctly, 2nd 3rd....agent loops.

I have tried all combinations of csv_header_row_count and csv_skip_rows and it doesn't seam to make any changes.

I have encluded the below that should be all you need to recreate the issue?

I have version Telegraf 1.21.1 (git: HEAD 7c9a9c1)

telegraf.conf

[[inputs.exec]]
  commands = ["powershell.exe c:/Program` files/influxdata/telegraf/get-memory-details-csv-bytes.ps1"]
  timeout = "2s"
  name_override = "proc_mem"
  data_format = "csv"
  csv_header_row_count = 1
  csv_column_names = ["proc_name", "value"]
  csv_column_types = ["string", "int"]
  csv_skip_rows = 0
  csv_tag_columns = ["proc_name"]

get-memory-details-csv-bytes.ps1

 get-process | Group-Object -Property ProcessName | Select-Object -Property Name,@{Name='WorkingSet'; Expression={(($_.Group|Measure-Object WorkingSet64 -Sum).Sum)}} | ConvertTo-Csv -NoTypeInformation

ERROR msg:
2021-12-31T13:34:50Z E! [inputs.exec] Error in plugin: column type: parse int error strconv.ParseInt: parsing "WorkingSet": invalid syntax

where 'WorkingSet' is the CSV column header from the powershell CSV

CSV example:

"Name","WorkingSet"
"ALsvc","5169152"
"ApplicationFrameHost","26857472"
"conhost","193732608"
...

@jjh74
Copy link
Contributor

jjh74 commented Jan 10, 2022

Are the outputs collected in some sort of cache and batch processed during the flush?

I don't know, but I'm guessing that the parser is initialized once on startup and after first use the counters become 0 and after that cvs header is not skipped.

If I read the comments on #9865 correctly skipping csv header only once is ok/desired for example with inputs.tail but for others like inputs.exec header should be skipped for every interval/input.
(@powersj / @srebhan: Can you take a look. I think #9865 (see comment #10347 (comment)) broke csv parser for inputs.exec)

(I think as a workaround you can use starlark to drop the incorrect csv header metric:

[[processors.starlark]]
  namepass = [ "mtr" ]
  source = '''
def apply(metric):
   if metric.fields["best"] != "Best":
       return metric
   else:
       return None
'''

)

@etycomputer
Copy link
Contributor

Hi @jjh74, I think the issue is that the inputs.exec is not creating a new parser object for each execution.
Currently, there is no functionality to reset the parser to its original state.
I am happy to work on this issue but I need feedback on how to best tackle this issue.

@sjwang90
Copy link
Contributor

sjwang90 commented Jan 11, 2022

Hi @DennisBayer @byte-opps @jjh74 @etycomputer - Would you be willing to test the artifacts on this parser restructuring PR here: #8791 (comment). This PR should address csv_skip_rows issues. Any feedback please leave on PR #8791. Thank you!

Update: artificats rebuilt as of Jan-12-2022

@sjwang90
Copy link
Contributor

Closed by #8791

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/exec bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

5 participants