[receiver/filelog] Support detection of headers in header-based log formats (e.g. W3C) #18198

BinaryFissionGames · 2023-01-31T14:41:47Z

Component(s)

receiver/filelog

Is your feature request related to a problem? Please describe.

The W3C log format defines its fields through a list of headers. This allows any agent that is aware of these headers to parse any W3C log, even if the headers change mid-way through the log file (as they could in e.g. Microsoft IIS logs).

The filelog receiver currently does not support parsing these fields and using them to parse CSV lines.

Describe the solution you'd like

Ideally, there would be some way to configure the filelog receiver to recognize and pass these headers to the CSV parser so that the log lines can be parsed based on the headers.

In Stanza, this functionality was implemented in the following PRs:

Tangentially related:

Csv header delimiter observIQ/stanza#370

The way it worked was the the filelog receiver would save the header line, adding it as an attribute to each log record read from the file.

Later in the pipeline, the CSV file would be able to use this attribute as dynamic headers, which allowed the log line to be parsed based on the header attribute that the filelog receiver added.

Describe alternatives you've considered

I haven't thought of other solutions besides the one implemented in stanza; Would love to hear other ideas!

Additional context

Sample W3C log line, for context:

W3C log

#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2022-08-09 20:25:26
#Fields: date time s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken
2022-08-09 20:25:26 W3SVC1 <Server> 127.0.0.1 GET /query param1=1&parma2=2 80 - 127.0.0.1 HTTP/1.1 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:103.0)+Gecko/20100101+Firefox/103.0 - - localhost 404 0 2 5029 464 83
2022-08-09 20:25:29 W3SVC1 <Server> 127.0.0.1 GET /query - 80 - 127.0.0.1 HTTP/1.1 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:103.0)+Gecko/20100101+Firefox/103.0 - - localhost 404 0 2 5007 446 1
2022-08-09 20:25:32 W3SVC1 <Server> 127.0.0.1 GET / - 80 - 127.0.0.1 HTTP/1.1 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:103.0)+Gecko/20100101+Firefox/103.0 - - localhost 200 0 0 927 441 1
2022-08-09 20:25:32 W3SVC1 <Server> 127.0.0.1 GET /iisstart.png - 80 - 127.0.0.1 HTTP/1.1 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:103.0)+Gecko/20100101+Firefox/103.0 - http://localhost/ localhost 200 0 0 99937 374 7

The text was updated successfully, but these errors were encountered:

github-actions · 2023-01-31T14:42:12Z

Pinging code owners:

receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski · 2023-01-31T16:30:43Z

I think this functionality should be supported in some way, but it would be best if we can justify enhancements to each operator independently. This will avoid a scenario where loosely coupled operators have overly specific dependencies on each other.

The csv_parser already supports a header_attribute setting that behaves as you've suggested above.

A header_delimiter setting is easily justifiable in my opinion, as this simply drops the assumption that header and header_attribute must use the same delimiter as the data lines. I don't see any downsides to this.

The changes to the file_input operator are a little more difficult to justify, but I think we can take a step back from W3C format and define a broader use case with a solution that satisfies the same requirements. The use case as I see it:

A file may contain a header that is different than rest of the file. This header may contain metadata about the file that should be attached to each individual record read from the file. In this way, this information is akin to log.file.name, log.file.path etc,
The header may consist of multiple lines. It may be necessary to parse these lines in order to isolate the metadata.

A perfect solution here would make minimal assumptions about the specific format of the header and introduce minimal complexity to the codebase. Still, I think it is necessary to assume that we are working with a header. In other words, I do not think we should solve for a case where metadata about the file is discovered and/or updated throughout the reading of the file.

I have some ideas for what this should look like and will post those when I have a moment to organize them.

BinaryFissionGames · 2023-01-31T17:01:18Z

Still, I think it is necessary to assume that we are working with a header. In other words, I do not think we should solve for a case where metadata about the file is discovered and/or updated throughout the reading of the file.

I think it's fair to make the assumption that metadata would be in a header (that is, in a section before any log lines begin).

djaglowski · 2023-01-31T18:01:20Z

Regarding changes to the file_input operator - we already manage per-file metadata using a FileAttributes struct, so we should not have trouble managing additional per-file metadata, once isolated.

I spoke with @BinaryFissionGames offline and identified a potential solution for isolating header metadata. I've added some additional context and suggestions:

Add a new section to the config, tentatively called header. This section and its associated behaviors should initially be enabled with a feature gate.
- The presence of the header section in the config indicates that the user intends to parse metadata from a file header.
- The header section takes inspiration from the multiline configuration, with the idea being that the user will specify a regex that matches each line in the header and fails once the header has been consumed.
- When the operator begins reading a new file, it will evaluate the regex against each line until it fails. While the regex matches, these lines are aggregated into a single multiline entry.
  - This "header entry" will then be fed into a dedicated "header pipeline", which will parse metadata from the header entry.
  - At the end of the pipeline, the attributes of this entry are permanently associated with the file such that all records from the file will be emitted with these attributes.

Sample configuration:

receivers:
  filelog:
    include: foo*.log
    header: 
      multiline_pattern: '...'
      metadata_operators:
        - type: regex_parser
          regex: '...'
    operators:
      - type: json_parser
      ...

BinaryFissionGames · 2023-02-15T18:05:08Z

@djaglowski Could you assign me to this issue?

BinaryFissionGames · 2023-04-25T18:31:55Z

Completed with #18921

Krishnadas-KP · 2023-07-27T04:43:48Z

@BinaryFissionGames How can we enable this feature ? I want to ignore the header lines from IIS logs before exporting

djaglowski · 2023-07-27T12:38:38Z

@Krishnadas-KP, see https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver#header-metadata-parsing

xieyuguang · 2023-08-23T08:27:13Z

@djaglowski How to support multiline header like glog?

Log file created at: 2023/08/23 10:31:46
Running on machine: MACHINE_XXX
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg

For example, I need to extract the machine name MACHINE_XXX from the header. I tried the relevant configuration, but it seems that the header only supports line by line matching.

djaglowski · 2023-08-23T13:47:08Z

@xieyuguang, you may be able to do this with the router operator. Roughly:

filelog:
  ...
  header:
    pattern: '^.+: .+$'
    metadata_operators:
      - type: router
        routes:
          - output: create_at_parser
            expr: 'bodymatches "^Log file created at: .*$"'
          - output: running_on_parser
            expr: 'body matches "^ Running on machine: .*$"'
          ...
      - type: regex_parser
        id: create_at_parser
        regex: '^Log file created at: (?P<log.file.created_time>.+)$'
      - type: regex_parser
        id: running_on_parser
        regex: '^ Running on machine: (?P<machine.name>.+)$'
      ...

You can read more about this type of pipeline here.

BinaryFissionGames added enhancement New feature or request needs triage New item requiring triage labels Jan 31, 2023

github-actions bot added the receiver/filelog label Jan 31, 2023

djaglowski assigned BinaryFissionGames Feb 15, 2023

This was referenced Feb 27, 2023

[receiver/filelog] Add support for parsing header lines as log metadata #18921

Merged

[pkg/stanza] Add header_delimiter option to the csv_parser #18929

Merged

atoulme removed the needs triage New item requiring triage label Mar 7, 2023

BinaryFissionGames closed this as completed Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/filelog] Support detection of headers in header-based log formats (e.g. W3C) #18198

[receiver/filelog] Support detection of headers in header-based log formats (e.g. W3C) #18198

BinaryFissionGames commented Jan 31, 2023 •

edited

Loading

github-actions bot commented Jan 31, 2023

djaglowski commented Jan 31, 2023

BinaryFissionGames commented Jan 31, 2023

djaglowski commented Jan 31, 2023

BinaryFissionGames commented Feb 15, 2023

BinaryFissionGames commented Apr 25, 2023

Krishnadas-KP commented Jul 27, 2023

djaglowski commented Jul 27, 2023

xieyuguang commented Aug 23, 2023

djaglowski commented Aug 23, 2023

[receiver/filelog] Support detection of headers in header-based log formats (e.g. W3C) #18198

[receiver/filelog] Support detection of headers in header-based log formats (e.g. W3C) #18198

Comments

BinaryFissionGames commented Jan 31, 2023 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Jan 31, 2023

djaglowski commented Jan 31, 2023

BinaryFissionGames commented Jan 31, 2023

djaglowski commented Jan 31, 2023

BinaryFissionGames commented Feb 15, 2023

BinaryFissionGames commented Apr 25, 2023

Krishnadas-KP commented Jul 27, 2023

djaglowski commented Jul 27, 2023

xieyuguang commented Aug 23, 2023

djaglowski commented Aug 23, 2023

BinaryFissionGames commented Jan 31, 2023 •

edited

Loading