Alternative schemes for Elasticsearch generated document IDs?

I have recently run some benchmarks simulating dense nodes, and have noticed that heap usage related to storage of document IDs is the main driver behind heap usage for at least some types of data. To see how different types of document IDs perform/behave, I created [a Rally challenge](https://github.com/elastic/rally-eventdata-track/blob/master/eventdata/challenges/large-shard-sizing.json#L162) based on the [rally-eventdata-track](https://github.com/elastic/rally-eventdata-track) to index ~25GB data into single-shard indices using a few different types of document IDs.

I imported the metrics in Elasticsearch and got the following when visualizing it in Kibana:

![screen shot 2018-08-22 at 08 17 06](https://user-images.githubusercontent.com/2804455/44449182-56f83100-a5e5-11e8-8763-74290f6a327a.png)

_1. Heap usage per 120M documents by type of document ID_

![screen shot 2018-08-22 at 08 17 19](https://user-images.githubusercontent.com/2804455/44449198-61b2c600-a5e5-11e8-8e86-4b4f768ade44.png)

_2. Index size on disk for 120M documents by type of document ID_

Most of the alternative ID types tested contains a large random portion (UUID4, SHA*). It is interesting to see that the autogenerated IDs seem to take up less space on disk but use more than twice the amount of heap compared to the other types of IDs.

As using autogenerated IDs give a speed advantage at indexing time, I think it would be great to make the format/.structure of autogenerated IDs configurable per index so the user can choose between optimizing for heap or disk usage. This does not need to support arbitrary formats, but one option that is more efficient with respect to heap usage would be useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alternative schemes for Elasticsearch generated document IDs? #33049

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alternative schemes for Elasticsearch generated document IDs? #33049

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions