Skip to content

[ML] Result document IDs are not sufficiently unique #50613

Closed
@droberts195

Description

@droberts195
  • In order to ensure we don't get duplicate results if a job does the same work multiple times (for example after failing over from one node to another), our results documents have IDs generated from their contents, such that if two identical results are indexed the second will overwrite the first and we won't get a duplicate
  • Since we cannot literally have the document ID be the whole result, because the document IDs are limited to 512 bytes and the result could contain more data than this, we create an ID using a formula that we thought was likely to avoid ID collisions for different results
    • In particular, since by/over/partition field values can be long, the document ID includes a combination of the hashes of these values, plus the total length - the assumption was that it would be unlikely for there to be a hash collision unless the total length was different
  • Unfortunately, some by/over/partition values do produce hash collisions for identical total lengths of by/over/partition field values. An example is:
    • by field value = L018, over field value = null, partition field value = 128
    • by field value = L017, over field value = null, partition field value = 228
    • Both produce a hash of -2073230751

More variation is needed in the document IDs. However, before making the change an analysis of the possibilities for duplicate documents caused by the change is required. We need to consider all the places where we're assuming that duplicate results will overwrite one another and the likelihood of occurrence.

The problem affects at least model plot, forecast and anomaly record results. Other types of results, for example, influencer results, contain a hash of just one string. We should consider consistency in the way IDs are created across the different types.

Metadata

Metadata

Assignees

Labels

:mlMachine learning>bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions