[Docs] Discuss deduplication strategies in the Integrations Developer Guide

We have several strategies for handling duplicate events:

- Use clean pagination logic that avoids ingesting duplicates.
- Use the `fingerprint` processor to set an `_id` value.
- Use a latest transform, as we [do for IOC data](https://github.com/elastic/security-team/issues/6114).
- Just tolerate duplicates.

The nature of the data set may make a certain strategy preferable. Some relevant questions:
- Is the data set append-only or are events updated?
- What is the impact of duplicates? (e.g. do they inflate counts or cause excess alerts?)
- Do we receive information about deletions (soft deletes)?
- Do we need to expire old events?
- Do we want to retain a history of changes or just the latest state?

The transform approach has some IOC-specific support. Other uses are possible but see https://github.com/elastic/kibana/issues/134321 and https://github.com/elastic/kibana/issues/137278.

The [First Class Data Streams Elasticsearch Changes](https://docs.google.com/document/d/1s-8alSm98w0E80ubODSg_IaDcJXzi-rwni6J9CtQuA0/edit) document may be relevant.

There may be ways to improve upon our current deduplication strategies, but before that we can describe existing strategies and recommend when each should be used in the [Integrations Developer Guide](https://www.elastic.co/guide/en/integrations-developer/current/index.html).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Docs] Discuss deduplication strategies in the Integrations Developer Guide #11266

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Docs] Discuss deduplication strategies in the Integrations Developer Guide #11266

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions