Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filebeat][google workspace] - Added canonical sorting over fingerprint keys array to maintain key order #40055

Merged
merged 6 commits into from
Jul 2, 2024

Conversation

ShourieG
Copy link
Contributor

@ShourieG ShourieG commented Jun 28, 2024

Type of change

  • Bug

Proposed commit message

The Google Workspace module has had a long standing issue of generating duplicate entries for events. This is an attempt to address that issue by introducing a canonical order to the fingerprint keys. It is possible that over successive responses, the same event object could possibly be canonically unordered. This is hinted by issues like this where ingested events have unique fingerprints for the same event leading to the duplication of events in elasticsearch.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

User might initially get more duplicates after updating, but this will subside after sometime as the ingestion process self corrects. This is because the sorting will impact the fingerprint of events to an extent in the initial period.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 28, 2024
@ShourieG ShourieG self-assigned this Jun 28, 2024
@ShourieG ShourieG added Filebeat Filebeat Team:Service-Integrations Label for the Service Integrations team labels Jun 28, 2024
Copy link
Contributor

mergify bot commented Jun 28, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @ShourieG? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 28, 2024
@ShourieG ShourieG added needs_team Indicates that the issue/PR needs a Team:* label bugfix labels Jun 28, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 28, 2024
@ShourieG ShourieG marked this pull request as ready for review June 28, 2024 10:46
@ShourieG ShourieG requested a review from a team as a code owner June 28, 2024 10:46
Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts that this will fix the issue if it is related to #39859; in that issue we see that there are missing fields in the resulting document; there is no json.id.uniqueQualifier/event.id, json.id.applicationName/event.provider or json.id.customerId/organization.id which are all (in the first incarnation) used in the fingerprinting processor. These do all show up in the event.original, so we know that they have been received. Looking at the processors' implementations, I don't see any obvious reason why fields would be being lost.

What I do find confusing is that in the documents shown in the issue, we still see the json. … fields, but these should have been removed here (@rlevytskyi do you have local changes to your configuration that would have stopped the ingest pipeline being run?)

x-pack/filebeat/module/google_workspace/config/common.js Outdated Show resolved Hide resolved
@rlevytskyi
Copy link

do you have local changes to your configuration that would have stopped the ingest pipeline being run

Absolutely no. Events got fetched by Filebeat, then passed to Logstash, then passed to Opensearch.

@efd6
Copy link
Contributor

efd6 commented Jul 1, 2024

@rlevytskyi Thanks.

This complicates the issue and might explain the confounding behaviour.

then passed to Logstash

Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because I don't see this causing any harm, but I still have doubt that it will fix the issue. Please get a second opinion.

@@ -25,15 +25,16 @@ var googleWorkspace = (function () {
"json.id.applicationName",
"json.id.customerId",
];
var dynKeyArr = [];
Copy link
Member

@andrewkroh andrewkroh Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole section looks like it is working around limitations in the beat fingerprint processor. And it's not entirely overcoming those limitations (hashing arrays and objects). I think the would be reasonable to enhance the fingerprint processor to accommodate this usage (the ES fingerprint processor does this). Or move this _id processing into ingest node.

But I would consider fingerprinting on the JSON string that is contained in message if the full contents are suitable for a fingerprint (i.e. each time an event is requested the api returns the same JSON).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried atm to change the fingerprint processor behaviour since changing it there would impact all modules utilising the processor and could possibly cause an initial wave of duplicates all across the board since the sorting will impact the fingerprint. I'll bring it up in the team sync once to discuss this and then make the change. For now I'm merging this change(otherwise it will go past FF) and we'll see and observe if this impacts duplicate creation in any way.

@ShourieG ShourieG merged commit 76a62eb into elastic:main Jul 2, 2024
19 checks passed
@ShourieG ShourieG deleted the bugfix/google_workspace branch July 2, 2024 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants