Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle multiline processors #2063

Merged
merged 10 commits into from
Sep 20, 2024

Conversation

haetamoudi
Copy link
Contributor

@haetamoudi haetamoudi commented Sep 2, 2024

The LastLine attribute is used to determine the end of a processor block in YAML files.

Currently, for YAML files such as:

---
processors:
  - script:
      source: |
        def a = 1;
        def b = 2;

The script processor is incorrectly identified with:

  • FirstLine:3
  • LastLine:4
    When it should be:
  • FirstLine:3
  • LastLine:6

LastLine is used to identify the end of a processor when calculating code coverage.
Right now, it’s not accurately reflecting multiline processors, which can make code coverage look worse than it is. This can cause issues with PRs in the integrations repo and might block merges.
Screenshot 2024-08-27 at 13 56 10

To fix this, we can consider a processor ends when the next one starts.
To identify the end of the last processor:

  • Find the next node that is not a processor node (e.g: on_failure node)
  • If there is none, find the end of the pipeline

Code coverage test results

Before the changes:

Code coverage link

  • the content of the script node is not considered as part of the processor
    Screenshot 2024-09-10 at 09 57 22

  • comments and empty lines inside the processors node are marked as uncovered
    Screenshot 2024-09-10 at 09 58 28

After the changes:

Code coverage link

  • the content of the script node is considered part of the processor
Screenshot 2024-09-10 at 09 57 31
  • comments and empty lines inside the processors node are considered part of the processors
Screenshot 2024-09-10 at 09 58 41

Fixes #2111

@haetamoudi haetamoudi added bug Something isn't working Team:Ecosystem Label for the Packages Ecosystem team labels Sep 2, 2024
@haetamoudi
Copy link
Contributor Author

test integrations

@elastic-vault-github-plugin-prod

Created or updated PR in integrations repository to test this version. Check elastic/integrations#10964

@haetamoudi haetamoudi marked this pull request as ready for review September 2, 2024 18:28
@haetamoudi haetamoudi requested a review from a team September 2, 2024 18:28
@haetamoudi haetamoudi self-assigned this Sep 2, 2024
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution!

In the build in integrations there is a failure that could be related: https://buildkite.com/elastic/integrations/builds/15324#0191b3b5-8906-4c82-ae84-994684452522/1582-1875

ERROR: Caused by: Line 149 is out of range in the file packages/teleport/data_stream/audit/elasticsearch/ingest_pipeline/default.yml (lines: 148)

And, is this change intended to fix #2045? If it is I think we also need to ignore empty and commented lines as mentioned in #2045 (comment).

internal/elasticsearch/ingest/processors.go Outdated Show resolved Hide resolved
internal/elasticsearch/ingest/processors.go Outdated Show resolved Hide resolved
haetamoudi and others added 2 commits September 3, 2024 10:21
Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
@jsoriano
Copy link
Member

jsoriano commented Sep 4, 2024

test integrations

@elastic-vault-github-plugin-prod

Created or updated PR in integrations repository to test this version. Check elastic/integrations#11003

@haetamoudi
Copy link
Contributor Author

one XML is generated with additional LineToCover and it's causing the issue why the integration tests

@jsoriano
Copy link
Member

jsoriano commented Sep 4, 2024

one XML is generated with additional LineToCover and it's causing the issue why the integration tests

This could be related to this change, right?

@haetamoudi
Copy link
Contributor Author

haetamoudi commented Sep 4, 2024

one XML is generated with additional LineToCover and it's causing the issue why the integration tests

This could be related to this change, right?

Probably as the failing XML is one created by the pipeline (the one I touched), I am running the integration tests on a fresh branch from main to be sure.

@haetamoudi
Copy link
Contributor Author

I think I found the issue. The function is counting multilines here because of the \n. I'll fix that and add some tests

Comment on lines 229 to 282
processors:
- script:
description: Drops null/empty values recursively.
tag: script_drop_null_empty_values
lang: painless
source: "def a = b \n; def b = 2; \n"
`),
expected: []Processor{
{Type: "script", FirstLine: 3, LastLine: 7},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating this test example with two processors, I think it does not report the expected lines in the coverage.

For instance if this test is modified to include two processors whose definition is multiline like this

			content: []byte(`---
processors:
  - script:
      description: Drops null/empty values recursively.
      tag: script_drop_null_empty_values
      lang: painless
      source: "def a = b \n; def b = 2; \n"
  - script:
      lang: painless
      source: "def a = b \n; def b = 2; \n def c = 4; \n def d = 5; \n"
  - set:
      field: "foo"
      value: "bar"
      
      
`),

This gives this result:

[]ingest.Processor{ingest.Processor{Type:"script", FirstLine:3, LastLine:9}, ingest.Processor{Type:"script", FirstLine:8, LastLine:13}, ingest.Processor{Type:"set", FirstLine:11, LastLine:13}}

But according to that:

  • the second processors starts (line 8) before the first processor finishes (line 9)
  • the second processor finishes in the same line as the third processor.

Should it be the result this ?

[]ingest.Processor{ingest.Processor{Type:"script", FirstLine:3, LastLine:7}, ingest.Processor{Type:"script", FirstLine:8, LastLine:10}, ingest.Processor{Type:"script", FirstLine:11, LastLine:13}}


I think to keep the coverage reports reporting the right lines, if that scenario happens (a processors is defined as oneline with  some `\n` in it), that should be reported as just one line. Since in the YAML file, it is in the same line too. If that is not the case, probably the coverage reports would set incorrect lines as covered or not covered.


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to revise the logic for determining the end of a processor. Simply counting line breaks doesn't work well, as in cases like:

- grok:
    tag: Extract header
    field: message
    trim_value: " \n\n"

will count, for example, 5 lines instead of 4.

One other option is to consider that a processor ends the line before the next one starts. However, I am running into issues when setting the LastLine number for the last processor. I can't just set it to the end of the file because there could be other elements afterward, like:

- rename:
    field: source.as.organization_name
    target_field: source.as.organization.name
    ignore_missing: true
on_failure:
- set:
    field: error.message
    value: '{{ _ingest.on_failure_message }}'

Copy link
Contributor Author

@haetamoudi haetamoudi Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is, when using yaml.Unmarshal, the YAML parser will interpret
"a = 1\nb = 2"
as

a = 1
b = 2

So I can't really base the logic on lines breaks

@haetamoudi
Copy link
Contributor Author

test integrations

@elastic-vault-github-plugin-prod

Created or updated PR in integrations repository to test this version. Check elastic/integrations#11032

@haetamoudi
Copy link
Contributor Author

test integrations

@elastic-vault-github-plugin-prod

Created or updated PR in integrations repository to test this version. Check elastic/integrations#11032

@haetamoudi
Copy link
Contributor Author

Finally got a successful build on the integrations repo: https://buildkite.com/elastic/integrations/builds/15698

@haetamoudi haetamoudi marked this pull request as ready for review September 10, 2024 10:41
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking to the changes here, I wonder if we should do #2045 first, so empty and commented lines are not included in coverage.

Co-authored-by: Mario Rodriguez Molins <marrodmo@gmail.com>
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @haetamoudi

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @mrodm please review if your last comments have been addressed.

Copy link
Contributor

@mrodm mrodm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !
Thanks @haetamoudi !

@haetamoudi haetamoudi merged commit ee40f69 into elastic:main Sep 20, 2024
3 checks passed
@haetamoudi haetamoudi deleted the handle-multiline-processor branch September 20, 2024 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Ecosystem Label for the Packages Ecosystem team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Test Coverage] Multiline YAML strings not accounted for in pipelines
4 participants