-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Optimize the order of bytes in uuids for better compression. #24615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
id(UUIDs.base64UUID()); | ||
String uid; | ||
if (indexCreatedVersion == null) { | ||
// we don't know the creation version yet, so default to random uuids, which are the only safe choice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain in the comment what "safe" means in this context? Like, it is the only choice that is guaranteed not to overlap or it is the only choice guaranteed to be fast-ish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a part that I wanted to discuss with someone who better understands implicit index creation when the index of an index request does not exist yet. Is there a way to know what its created version will be?
The issue is that I could not change the uid generation algorithm in place as there would be a rare chance that uids would collide. So I'm looking for ways to make things work when the index is not created. Using a random uid should work, but I wonder whether we could do better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased to mention collisions.
|
||
/** These are essentially flake ids (http://boundary.com/blog/2012/01/12/flake-a-decentralized-k-ordered-unique-id-generator-in-erlang) but | ||
* we use 6 (not 8) bytes for timestamp, and use 3 (not 2) bytes for sequence number. */ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space
} | ||
// protected for testing | ||
protected long currentTimeMillis() { | ||
return System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth looking at nanotime which doesn't go backwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would not work as we need some absolute times. For instance you switch to a different JVM that uses a different origin for nanotime, you might reuse existing ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about initializing to System.currentTimeMillis() on startup and using that time + System.nanoTime() on generation? Regardless, if it is worth doing at all it can wait for a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, maybe we should focus on the order of bytes here and rethink generation separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, maybe we should focus on the order of bytes here and rethink generation separately?
++
|
||
// protected for testing | ||
protected byte[] macAddress() { | ||
return SECURE_MUNGED_ADDRESS; | ||
} | ||
|
||
@Override | ||
public String getBase64UUID() { | ||
final int sequenceId = sequenceNumber.incrementAndGet() & 0xffffff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you skip the masking given the way you are using the sequence id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, because of the if (sequenceId == 0) {
check below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now.
} | ||
|
||
private static double testCompression(int numDocs, int numDocsPerSecond, int numNodes, Logger logger) throws Exception { | ||
final double intervalBetweenDocs = 1000. / numDocsPerSecond; // milliseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000.0
is easier for me to read.
putLong(uuidBytes, sequenceId, 12, 3); | ||
|
||
assert 9 + SECURE_MUNGED_ADDRESS.length == uuidBytes.length; | ||
// The order of bytes is important as Lucene uses a block tree (similar to a burst trie) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the explanations.
Indexing ids in binary form should help with indexing speed since we would have to compare fewer bytes upon sorting, should help with memory usage of the live version map since keys will be shorter, and might help with disk usage depending on how efficient the terms dictionary is at compressing terms. Since we can only expect base64 ids in the auto-generated case, this PR tries to use an encoding that makes the binary id equal to the base64-decoded id in the majority of cases (253 out of 256). It also specializes numeric ids, since this seems to be common when content that is stored in Elasticsearch comes from another database that uses eg. auto-increment ids. Another option could be to require base64 ids all the time. It would make things simpler but I'm not sure users would welcome this requirement. This PR should bring some benefits, but I expect it to be mostly useful when coupled with something like elastic#24615. Closes elastic#18154
Indexing ids in binary form should help with indexing speed since we would have to compare fewer bytes upon sorting, should help with memory usage of the live version map since keys will be shorter, and might help with disk usage depending on how efficient the terms dictionary is at compressing terms. Since we can only expect base64 ids in the auto-generated case, this PR tries to use an encoding that makes the binary id equal to the base64-decoded id in the majority of cases (253 out of 256). It also specializes numeric ids, since this seems to be common when content that is stored in Elasticsearch comes from another database that uses eg. auto-increment ids. Another option could be to require base64 ids all the time. It would make things simpler but I'm not sure users would welcome this requirement. This PR should bring some benefits, but I expect it to be mostly useful when coupled with something like #24615. Closes #18154
584cc1e
to
0d1b899
Compare
@elasticmachine please test it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, I also wonder in addition to what we do today should we maybe pass on the timestamp that is used to create the index. this is always consistent even across node restarts and can give us a common base? just an idea.
@@ -290,12 +291,14 @@ protected void doRun() throws Exception { | |||
case INDEX: | |||
IndexRequest indexRequest = (IndexRequest) docWriteRequest; | |||
MappingMetaData mappingMd = null; | |||
Version indexCreated = null; | |||
final IndexMetaData indexMetaData = metaData.index(concreteIndex); | |||
if (indexMetaData != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. addFailureIfIndexIsUnavailable does these checks already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @bleskes
@@ -484,7 +485,7 @@ public VersionType versionType() { | |||
} | |||
|
|||
|
|||
public void process(@Nullable MappingMetaData mappingMd, String concreteIndex) { | |||
public void process(@Nullable Version indexCreatedVersion, @Nullable MappingMetaData mappingMd, String concreteIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, given my comment about the null index metadata if this should not be nullable?
} | ||
// protected for testing | ||
protected long currentTimeMillis() { | ||
return System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, maybe we should focus on the order of bytes here and rethink generation separately?
++
final int sequenceId = sequenceNumber.incrementAndGet() & 0xffffff; | ||
long timestamp = System.currentTimeMillis(); | ||
|
||
synchronized (this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why we don't make lastTimestamp
an AtomicLong and use compareAndSet
since we don't care if the compare and set was successful as long as we are moving forward? something like this:
long currentTimestamp = System.currentTimeMillis();
long lastUsedTimestamp = lastTimestamp.get();
long timestamp = Math.max(currentTimestamp, lastUsedTimestamp);
if (sequenceId == 0) {
timestamp++;
}
lastTimestamp.compareAndSet(timestamp, lastUsedTimestamp);
if that doesn't work and we really need to block or have consistent values aren't we vulnerable in terms of collisions on the same timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on this for a separate PR if you don't mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more, it would ensure that lastUsedTimestamp
is greater than it was before, but not necessarily greater than timestamp
. So there is a possibility of collision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but don't we have the sequenc ID for this pupose. there are 2^31 docs to be indexed before you have a collision. I mean if System.currentTimeInMillis() goes backwards we reuse the same timestamp all the time? I am not sure how this would work at all if we can't sustain that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if System.currentTimeInMillis() goes backwards we reuse the same timestamp all the time
no, since we force the clock to tick forward whenever the sequence id is 0?
there are 2^31 docs to be indexed before you have a collision
2^24 actually, we only use the 3 lower bytes of the sequence id
4a7b675
to
9fd5ad5
Compare
I pushed more commits. |
@elasticmachine please test it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM we can fix the sync issue if we want in a followup it's and impl detail
Flake ids organize bytes in such a way that ids are ordered. However, we do not need that property and could reorganize bytes in an order that would better suit Lucene's terms dict instead. Some synthetic tests suggest that this change decreases the disk footprint of the `_id` field by about 50% in many cases (see `UUIDTests.testCompression`). For instance, when simulating the indexing of 10M docs at a rate of 10k docs per second, the current uid generator used 20.2 bytes per document on average, while this new generator which only puts bytes in a different order uses 9.6 bytes per document on average. We had already explored this idea in elastic#18209 but the attempt to share long common prefixes had had a bad impact on indexing speed. This time I have been more careful about putting discriminant bytes early in the `_id` in a way that preserves indexing speed on par with today, while still allowing for better compression.
This reverts commit 374cc7e898d03ac0db5a06dc4e750ae936014ae9.
cce0caa
to
837d88e
Compare
* master: Add an underscore to flood stage setting Avoid failing install if system-sysctl is masked Add another parent value option to join documentation (elastic#25609) Ensure we rewrite common queries to `match_none` if possible (elastic#25650) Remove reference to field-stats docs. Optimize the order of bytes in uuids for better compression. (elastic#24615) Fix BytesReferenceStreamInput#skip with offset (elastic#25634) Limit the number of concurrent shard requests per search request (elastic#25632)
* master: (181 commits) Use a non default port range in MockTransportService Add a shard filter search phase to pre-filter shards based on query rewriting (elastic#25658) Prevent excessive disk consumption by log files Migrate RestHttpResponseHeadersIT to ESRestTestCase (elastic#25675) Use config directory to find jvm.options Fix inadvertent rename of systemd tests Adding basic search request documentation for high level client (elastic#25651) Disallow lang to be used with Stored Scripts (elastic#25610) Fix typo in ScriptDocValues deprecation warnings (elastic#25672) Changes DocValueFieldsFetchSubPhase to reuse doc values iterators for multiple hits (elastic#25644) Query range fields by doc values when they are expected to be more efficient than points. Remove SearchHit#internalHits (elastic#25653) [DOCS] Reorganized the highlighting topic so it's less confusing. Add an underscore to flood stage setting Avoid failing install if system-sysctl is masked Add another parent value option to join documentation (elastic#25609) Ensure we rewrite common queries to `match_none` if possible (elastic#25650) Remove reference to field-stats docs. Optimize the order of bytes in uuids for better compression. (elastic#24615) Fix BytesReferenceStreamInput#skip with offset (elastic#25634) ...
Flake ids organize bytes in such a way that ids are ordered. However, we do not
need that property and could reorganize bytes in an order that would better suit
Lucene's terms dict instead.
Some synthetic tests suggest that this change decreases the disk footprint of
the
_id
field by about 50% in many cases (seeUUIDTests.testCompression
).For instance, when simulating the indexing of 10M docs at a rate of 10k docs
per second, the current uid generator used 20.2 bytes per document on average,
while this new generator which only puts bytes in a different order uses 9.6
bytes per document on average.
We had already explored this idea in #18209 but the attempt to share long common
prefixes had had a bad impact on indexing speed. This time I have been more
careful about putting discriminant bytes early in the
_id
in a way thatpreserves indexing speed on par with today, while still allowing for better
compression.