Skip to content

Alternative schemes for Elasticsearch generated document IDs? #33049

Closed
@cdahlqvist

Description

@cdahlqvist

I have recently run some benchmarks simulating dense nodes, and have noticed that heap usage related to storage of document IDs is the main driver behind heap usage for at least some types of data. To see how different types of document IDs perform/behave, I created a Rally challenge based on the rally-eventdata-track to index ~25GB data into single-shard indices using a few different types of document IDs.

I imported the metrics in Elasticsearch and got the following when visualizing it in Kibana:

screen shot 2018-08-22 at 08 17 06

1. Heap usage per 120M documents by type of document ID

screen shot 2018-08-22 at 08 17 19

2. Index size on disk for 120M documents by type of document ID

Most of the alternative ID types tested contains a large random portion (UUID4, SHA*). It is interesting to see that the autogenerated IDs seem to take up less space on disk but use more than twice the amount of heap compared to the other types of IDs.

As using autogenerated IDs give a speed advantage at indexing time, I think it would be great to make the format/.structure of autogenerated IDs configurable per index so the user can choose between optimizing for heap or disk usage. This does not need to support arbitrary formats, but one option that is more efficient with respect to heap usage would be useful.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions