Skip to content

Optimize the order of bytes in uuids for better compression. #24615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 11, 2017

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented May 11, 2017

Flake ids organize bytes in such a way that ids are ordered. However, we do not
need that property and could reorganize bytes in an order that would better suit
Lucene's terms dict instead.

Some synthetic tests suggest that this change decreases the disk footprint of
the _id field by about 50% in many cases (see UUIDTests.testCompression).
For instance, when simulating the indexing of 10M docs at a rate of 10k docs
per second, the current uid generator used 20.2 bytes per document on average,
while this new generator which only puts bytes in a different order uses 9.6
bytes per document on average.

We had already explored this idea in #18209 but the attempt to share long common
prefixes had had a bad impact on indexing speed. This time I have been more
careful about putting discriminant bytes early in the _id in a way that
preserves indexing speed on par with today, while still allowing for better
compression.

id(UUIDs.base64UUID());
String uid;
if (indexCreatedVersion == null) {
// we don't know the creation version yet, so default to random uuids, which are the only safe choice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain in the comment what "safe" means in this context? Like, it is the only choice that is guaranteed not to overlap or it is the only choice guaranteed to be fast-ish?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a part that I wanted to discuss with someone who better understands implicit index creation when the index of an index request does not exist yet. Is there a way to know what its created version will be?

The issue is that I could not change the uid generation algorithm in place as there would be a rare chance that uids would collide. So I'm looking for ways to make things work when the index is not created. Using a random uid should work, but I wonder whether we could do better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased to mention collisions.


/** These are essentially flake ids (http://boundary.com/blog/2012/01/12/flake-a-decentralized-k-ordered-unique-id-generator-in-erlang) but
* we use 6 (not 8) bytes for timestamp, and use 3 (not 2) bytes for sequence number. */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space

}
// protected for testing
protected long currentTimeMillis() {
return System.currentTimeMillis();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth looking at nanotime which doesn't go backwards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would not work as we need some absolute times. For instance you switch to a different JVM that uses a different origin for nanotime, you might reuse existing ids.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about initializing to System.currentTimeMillis() on startup and using that time + System.nanoTime() on generation? Regardless, if it is worth doing at all it can wait for a followup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe we should focus on the order of bytes here and rethink generation separately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe we should focus on the order of bytes here and rethink generation separately?

++


// protected for testing
protected byte[] macAddress() {
return SECURE_MUNGED_ADDRESS;
}

@Override
public String getBase64UUID() {
final int sequenceId = sequenceNumber.incrementAndGet() & 0xffffff;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you skip the masking given the way you are using the sequence id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because of the if (sequenceId == 0) { check below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now.

}

private static double testCompression(int numDocs, int numDocsPerSecond, int numNodes, Logger logger) throws Exception {
final double intervalBetweenDocs = 1000. / numDocsPerSecond; // milliseconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000.0 is easier for me to read.

putLong(uuidBytes, sequenceId, 12, 3);

assert 9 + SECURE_MUNGED_ADDRESS.length == uuidBytes.length;
// The order of bytes is important as Lucene uses a block tree (similar to a burst trie)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the explanations.

jpountz added a commit to jpountz/elasticsearch that referenced this pull request Jul 6, 2017
Indexing ids in binary form should help with indexing speed since we would
have to compare fewer bytes upon sorting, should help with memory usage of
the live version map since keys will be shorter, and might help with disk
usage depending on how efficient the terms dictionary is at compressing
terms.

Since we can only expect base64 ids in the auto-generated case, this PR tries
to use an encoding that makes the binary id equal to the base64-decoded id in
the majority of cases (253 out of 256). It also specializes numeric ids, since
this seems to be common when content that is stored in Elasticsearch comes
from another database that uses eg. auto-increment ids.

Another option could be to require base64 ids all the time. It would make things
simpler but I'm not sure users would welcome this requirement.

This PR should bring some benefits, but I expect it to be mostly useful when
coupled with something like elastic#24615.

Closes elastic#18154
jpountz added a commit that referenced this pull request Jul 7, 2017
Indexing ids in binary form should help with indexing speed since we would
have to compare fewer bytes upon sorting, should help with memory usage of
the live version map since keys will be shorter, and might help with disk
usage depending on how efficient the terms dictionary is at compressing
terms.

Since we can only expect base64 ids in the auto-generated case, this PR tries
to use an encoding that makes the binary id equal to the base64-decoded id in
the majority of cases (253 out of 256). It also specializes numeric ids, since
this seems to be common when content that is stored in Elasticsearch comes
from another database that uses eg. auto-increment ids.

Another option could be to require base64 ids all the time. It would make things
simpler but I'm not sure users would welcome this requirement.

This PR should bring some benefits, but I expect it to be mostly useful when
coupled with something like #24615.

Closes #18154
@jpountz jpountz force-pushed the enhancement/better_uuids branch from 584cc1e to 0d1b899 Compare July 7, 2017 15:52
@jpountz
Copy link
Contributor Author

jpountz commented Jul 7, 2017

I've been working on that PR again now that #25352 is merged. I updated it a bit so that it should improve indexing speed a bit more and help a bit less with compression (but still better than current master). @s1monw would you mind having a look?

@jpountz jpountz requested a review from s1monw July 7, 2017 15:55
@jpountz
Copy link
Contributor Author

jpountz commented Jul 10, 2017

@elasticmachine please test it

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, I also wonder in addition to what we do today should we maybe pass on the timestamp that is used to create the index. this is always consistent even across node restarts and can give us a common base? just an idea.

@@ -290,12 +291,14 @@ protected void doRun() throws Exception {
case INDEX:
IndexRequest indexRequest = (IndexRequest) docWriteRequest;
MappingMetaData mappingMd = null;
Version indexCreated = null;
final IndexMetaData indexMetaData = metaData.index(concreteIndex);
if (indexMetaData != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this can never happen. if you at what we do here we add a failure and skip this entire block when then index metadata is not available. I think this is legacy and can / must be removed. I still think @bleskes should confirm that since he knows this class best

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. addFailureIfIndexIsUnavailable does these checks already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @bleskes

@@ -484,7 +485,7 @@ public VersionType versionType() {
}


public void process(@Nullable MappingMetaData mappingMd, String concreteIndex) {
public void process(@Nullable Version indexCreatedVersion, @Nullable MappingMetaData mappingMd, String concreteIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, given my comment about the null index metadata if this should not be nullable?

}
// protected for testing
protected long currentTimeMillis() {
return System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe we should focus on the order of bytes here and rethink generation separately?

++

final int sequenceId = sequenceNumber.incrementAndGet() & 0xffffff;
long timestamp = System.currentTimeMillis();

synchronized (this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we don't make lastTimestamp an AtomicLong and use compareAndSet since we don't care if the compare and set was successful as long as we are moving forward? something like this:

long currentTimestamp = System.currentTimeMillis();
long lastUsedTimestamp = lastTimestamp.get();
long timestamp  = Math.max(currentTimestamp, lastUsedTimestamp);
if (sequenceId == 0) {
  timestamp++;
}
lastTimestamp.compareAndSet(timestamp, lastUsedTimestamp);

if that doesn't work and we really need to block or have consistent values aren't we vulnerable in terms of collisions on the same timestamp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on this for a separate PR if you don't mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, it would ensure that lastUsedTimestamp is greater than it was before, but not necessarily greater than timestamp. So there is a possibility of collision?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but don't we have the sequenc ID for this pupose. there are 2^31 docs to be indexed before you have a collision. I mean if System.currentTimeInMillis() goes backwards we reuse the same timestamp all the time? I am not sure how this would work at all if we can't sustain that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if System.currentTimeInMillis() goes backwards we reuse the same timestamp all the time

no, since we force the clock to tick forward whenever the sequence id is 0?

there are 2^31 docs to be indexed before you have a collision

2^24 actually, we only use the 3 lower bytes of the sequence id

@jpountz jpountz force-pushed the enhancement/better_uuids branch from 4a7b675 to 9fd5ad5 Compare July 10, 2017 09:28
@jpountz
Copy link
Contributor Author

jpountz commented Jul 10, 2017

I pushed more commits.

@jpountz
Copy link
Contributor Author

jpountz commented Jul 10, 2017

@elasticmachine please test it

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM we can fix the sync issue if we want in a followup it's and impl detail

jpountz added 8 commits July 11, 2017 17:26
Flake ids organize bytes in such a way that ids are ordered. However, we do not
need that property and could reorganize bytes in an order that would better suit
Lucene's terms dict instead.

Some synthetic tests suggest that this change decreases the disk footprint of
the `_id` field by about 50% in many cases (see `UUIDTests.testCompression`).
For instance, when simulating the indexing of 10M docs at a rate of 10k docs
per second, the current uid generator used 20.2 bytes per document on average,
while this new generator which only puts bytes in a different order uses 9.6
bytes per document on average.

We had already explored this idea in elastic#18209 but the attempt to share long common
prefixes had had a bad impact on indexing speed. This time I have been more
careful about putting discriminant bytes early in the `_id` in a way that
preserves indexing speed on par with today, while still allowing for better
compression.
This reverts commit 374cc7e898d03ac0db5a06dc4e750ae936014ae9.
@jpountz jpountz force-pushed the enhancement/better_uuids branch from cce0caa to 837d88e Compare July 11, 2017 15:28
@jpountz jpountz merged commit f9fbce8 into elastic:master Jul 11, 2017
@jpountz jpountz deleted the enhancement/better_uuids branch July 11, 2017 15:28
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jul 12, 2017
* master:
  Add an underscore to flood stage setting
  Avoid failing install if system-sysctl is masked
  Add another parent value option to join documentation (elastic#25609)
  Ensure we rewrite common queries to `match_none` if possible (elastic#25650)
  Remove reference to field-stats docs.
  Optimize the order of bytes in uuids for better compression. (elastic#24615)
  Fix BytesReferenceStreamInput#skip with offset (elastic#25634)
  Limit the number of concurrent shard requests per search request (elastic#25632)
jasontedor added a commit to fred84/elasticsearch that referenced this pull request Jul 12, 2017
* master: (181 commits)
  Use a non default port range in MockTransportService
  Add a shard filter search phase to pre-filter shards based on query rewriting (elastic#25658)
  Prevent excessive disk consumption by log files
  Migrate RestHttpResponseHeadersIT to ESRestTestCase (elastic#25675)
  Use config directory to find jvm.options
  Fix inadvertent rename of systemd tests
  Adding basic search request documentation for high level client (elastic#25651)
  Disallow lang to be used with Stored Scripts (elastic#25610)
  Fix typo in ScriptDocValues deprecation warnings (elastic#25672)
  Changes DocValueFieldsFetchSubPhase to reuse doc values iterators for multiple hits (elastic#25644)
  Query range fields by doc values when they are expected to be more efficient than points.
  Remove SearchHit#internalHits (elastic#25653)
  [DOCS] Reorganized the highlighting topic so it's less confusing.
  Add an underscore to flood stage setting
  Avoid failing install if system-sysctl is masked
  Add another parent value option to join documentation (elastic#25609)
  Ensure we rewrite common queries to `match_none` if possible (elastic#25650)
  Remove reference to field-stats docs.
  Optimize the order of bytes in uuids for better compression. (elastic#24615)
  Fix BytesReferenceStreamInput#skip with offset (elastic#25634)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants