Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/filelog] Support detection of headers in header-based log formats (e.g. W3C) #18198

Closed
BinaryFissionGames opened this issue Jan 31, 2023 · 10 comments
Assignees
Labels
enhancement New feature or request receiver/filelog

Comments

@BinaryFissionGames
Copy link
Contributor

BinaryFissionGames commented Jan 31, 2023

Component(s)

receiver/filelog

Is your feature request related to a problem? Please describe.

The W3C log format defines its fields through a list of headers. This allows any agent that is aware of these headers to parse any W3C log, even if the headers change mid-way through the log file (as they could in e.g. Microsoft IIS logs).

The filelog receiver currently does not support parsing these fields and using them to parse CSV lines.

Describe the solution you'd like

Ideally, there would be some way to configure the filelog receiver to recognize and pass these headers to the CSV parser so that the log lines can be parsed based on the headers.

In Stanza, this functionality was implemented in the following PRs:

Tangentially related:

The way it worked was the the filelog receiver would save the header line, adding it as an attribute to each log record read from the file.

Later in the pipeline, the CSV file would be able to use this attribute as dynamic headers, which allowed the log line to be parsed based on the header attribute that the filelog receiver added.

Describe alternatives you've considered

I haven't thought of other solutions besides the one implemented in stanza; Would love to hear other ideas!

Additional context

Sample W3C log line, for context:

W3C log
#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2022-08-09 20:25:26
#Fields: date time s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken
2022-08-09 20:25:26 W3SVC1 <Server> 127.0.0.1 GET /query param1=1&parma2=2 80 - 127.0.0.1 HTTP/1.1 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:103.0)+Gecko/20100101+Firefox/103.0 - - localhost 404 0 2 5029 464 83
2022-08-09 20:25:29 W3SVC1 <Server> 127.0.0.1 GET /query - 80 - 127.0.0.1 HTTP/1.1 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:103.0)+Gecko/20100101+Firefox/103.0 - - localhost 404 0 2 5007 446 1
2022-08-09 20:25:32 W3SVC1 <Server> 127.0.0.1 GET / - 80 - 127.0.0.1 HTTP/1.1 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:103.0)+Gecko/20100101+Firefox/103.0 - - localhost 200 0 0 927 441 1
2022-08-09 20:25:32 W3SVC1 <Server> 127.0.0.1 GET /iisstart.png - 80 - 127.0.0.1 HTTP/1.1 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:103.0)+Gecko/20100101+Firefox/103.0 - http://localhost/ localhost 200 0 0 99937 374 7
@BinaryFissionGames BinaryFissionGames added enhancement New feature or request needs triage New item requiring triage labels Jan 31, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

I think this functionality should be supported in some way, but it would be best if we can justify enhancements to each operator independently. This will avoid a scenario where loosely coupled operators have overly specific dependencies on each other.


The csv_parser already supports a header_attribute setting that behaves as you've suggested above.

A header_delimiter setting is easily justifiable in my opinion, as this simply drops the assumption that header and header_attribute must use the same delimiter as the data lines. I don't see any downsides to this.


The changes to the file_input operator are a little more difficult to justify, but I think we can take a step back from W3C format and define a broader use case with a solution that satisfies the same requirements. The use case as I see it:

  • A file may contain a header that is different than rest of the file. This header may contain metadata about the file that should be attached to each individual record read from the file. In this way, this information is akin to log.file.name, log.file.path etc,
  • The header may consist of multiple lines. It may be necessary to parse these lines in order to isolate the metadata.

A perfect solution here would make minimal assumptions about the specific format of the header and introduce minimal complexity to the codebase. Still, I think it is necessary to assume that we are working with a header. In other words, I do not think we should solve for a case where metadata about the file is discovered and/or updated throughout the reading of the file.

I have some ideas for what this should look like and will post those when I have a moment to organize them.

@BinaryFissionGames
Copy link
Contributor Author

Still, I think it is necessary to assume that we are working with a header. In other words, I do not think we should solve for a case where metadata about the file is discovered and/or updated throughout the reading of the file.

I think it's fair to make the assumption that metadata would be in a header (that is, in a section before any log lines begin).

@djaglowski
Copy link
Member

Regarding changes to the file_input operator - we already manage per-file metadata using a FileAttributes struct, so we should not have trouble managing additional per-file metadata, once isolated.

I spoke with @BinaryFissionGames offline and identified a potential solution for isolating header metadata. I've added some additional context and suggestions:

  • Add a new section to the config, tentatively called header. This section and its associated behaviors should initially be enabled with a feature gate.
    • The presence of the header section in the config indicates that the user intends to parse metadata from a file header.
    • The header section takes inspiration from the multiline configuration, with the idea being that the user will specify a regex that matches each line in the header and fails once the header has been consumed.
    • When the operator begins reading a new file, it will evaluate the regex against each line until it fails. While the regex matches, these lines are aggregated into a single multiline entry.
      • This "header entry" will then be fed into a dedicated "header pipeline", which will parse metadata from the header entry.
      • At the end of the pipeline, the attributes of this entry are permanently associated with the file such that all records from the file will be emitted with these attributes.

Sample configuration:

receivers:
  filelog:
    include: foo*.log
    header: 
      multiline_pattern: '...'
      metadata_operators:
        - type: regex_parser
          regex: '...'
    operators:
      - type: json_parser
      ...

@BinaryFissionGames
Copy link
Contributor Author

@djaglowski Could you assign me to this issue?

@BinaryFissionGames
Copy link
Contributor Author

Completed with #18921

@Krishnadas-KP
Copy link

@BinaryFissionGames How can we enable this feature ? I want to ignore the header lines from IIS logs before exporting

@xieyuguang
Copy link

@djaglowski How to support multiline header like glog?

Log file created at: 2023/08/23 10:31:46
Running on machine: MACHINE_XXX
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg

For example, I need to extract the machine name MACHINE_XXX from the header. I tried the relevant configuration, but it seems that the header only supports line by line matching.

@djaglowski
Copy link
Member

@xieyuguang, you may be able to do this with the router operator. Roughly:

filelog:
  ...
  header:
    pattern: '^.+: .+$'
    metadata_operators:
      - type: router
        routes:
          - output: create_at_parser
            expr: 'bodymatches "^Log file created at: .*$"'
          - output: running_on_parser
            expr: 'body matches "^ Running on machine: .*$"'
          ...
      - type: regex_parser
        id: create_at_parser
        regex: '^Log file created at: (?P<log.file.created_time>.+)$'
      - type: regex_parser
        id: running_on_parser
        regex: '^ Running on machine: (?P<machine.name>.+)$'
      ...

You can read more about this type of pipeline here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request receiver/filelog
Projects
None yet
Development

No branches or pull requests

5 participants