Skip to content

[Docs] Discuss deduplication strategies in the Integrations Developer Guide #11266

Open
@chrisberkhout

Description

@chrisberkhout

We have several strategies for handling duplicate events:

  • Use clean pagination logic that avoids ingesting duplicates.
  • Use the fingerprint processor to set an _id value.
  • Use a latest transform, as we do for IOC data.
  • Just tolerate duplicates.

The nature of the data set may make a certain strategy preferable. Some relevant questions:

  • Is the data set append-only or are events updated?
  • What is the impact of duplicates? (e.g. do they inflate counts or cause excess alerts?)
  • Do we receive information about deletions (soft deletes)?
  • Do we need to expire old events?
  • Do we want to retain a history of changes or just the latest state?

The transform approach has some IOC-specific support. Other uses are possible but see elastic/kibana#134321 and elastic/kibana#137278.

The First Class Data Streams Elasticsearch Changes document may be relevant.

There may be ways to improve upon our current deduplication strategies, but before that we can describe existing strategies and recommend when each should be used in the Integrations Developer Guide.

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions