Open
Description
We have several strategies for handling duplicate events:
- Use clean pagination logic that avoids ingesting duplicates.
- Use the
fingerprint
processor to set an_id
value. - Use a latest transform, as we do for IOC data.
- Just tolerate duplicates.
The nature of the data set may make a certain strategy preferable. Some relevant questions:
- Is the data set append-only or are events updated?
- What is the impact of duplicates? (e.g. do they inflate counts or cause excess alerts?)
- Do we receive information about deletions (soft deletes)?
- Do we need to expire old events?
- Do we want to retain a history of changes or just the latest state?
The transform approach has some IOC-specific support. Other uses are possible but see elastic/kibana#134321 and elastic/kibana#137278.
The First Class Data Streams Elasticsearch Changes document may be relevant.
There may be ways to improve upon our current deduplication strategies, but before that we can describe existing strategies and recommend when each should be used in the Integrations Developer Guide.