Description
I have recently run some benchmarks simulating dense nodes, and have noticed that heap usage related to storage of document IDs is the main driver behind heap usage for at least some types of data. To see how different types of document IDs perform/behave, I created a Rally challenge based on the rally-eventdata-track to index ~25GB data into single-shard indices using a few different types of document IDs.
I imported the metrics in Elasticsearch and got the following when visualizing it in Kibana:
1. Heap usage per 120M documents by type of document ID
2. Index size on disk for 120M documents by type of document ID
Most of the alternative ID types tested contains a large random portion (UUID4, SHA*). It is interesting to see that the autogenerated IDs seem to take up less space on disk but use more than twice the amount of heap compared to the other types of IDs.
As using autogenerated IDs give a speed advantage at indexing time, I think it would be great to make the format/.structure of autogenerated IDs configurable per index so the user can choose between optimizing for heap or disk usage. This does not need to support arbitrary formats, but one option that is more efficient with respect to heap usage would be useful.