[FLINK-14346] [serialization] faster implementation of StringValue writeString and readString#10358
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 4396fe5 (Wed Dec 04 16:10:33 UTC 2019) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
|
Yet another set of benchmarks, out of flink-benchmarks PR (dataArtisans/flink-benchmarks#36). The only thing I notice here is a slight performance degradation on small strings (<=4 symbols), which I'll address in the next patch. |
|
Yep, there is a regression for strings shorter than 6 characters, see this more granular benchmark: |
|
For deserialization, allocating a small buffer is not that much a significant overhead, so the fallback is not really needed: |
AHeise
left a comment
There was a problem hiding this comment.
Very exciting stuff! I tried to first understand what's going on and then also find some more performance improvements. That's why there are quite a few comments although the code is already in very good shape.
My biggest wish would be to get rid of the fallback. I hope that you could spare some more time to check if a more accurate buflen while writing would make it any quicker. You could even mock that by setting it to 1 for ascii strings, although you could probably just plugin the small expression that I provided.
Of course, if that wish isn't going to happen one (no time) or the other way (no success), I'd be happy to approve it anyhow.
flink-core/src/main/java/org/apache/flink/types/StringValue.java
Outdated
Show resolved
Hide resolved
flink-core/src/main/java/org/apache/flink/types/StringValue.java
Outdated
Show resolved
Hide resolved
flink-core/src/main/java/org/apache/flink/types/StringValue.java
Outdated
Show resolved
Hide resolved
flink-core/src/main/java/org/apache/flink/types/StringValue.java
Outdated
Show resolved
Hide resolved
flink-core/src/main/java/org/apache/flink/types/StringValue.java
Outdated
Show resolved
Hide resolved
flink-core/src/main/java/org/apache/flink/types/StringValue.java
Outdated
Show resolved
Hide resolved
flink-core/src/main/java/org/apache/flink/types/StringValue.java
Outdated
Show resolved
Hide resolved
flink-core/src/main/java/org/apache/flink/types/StringValue.java
Outdated
Show resolved
Hide resolved
flink-core/src/main/java/org/apache/flink/types/StringValue.java
Outdated
Show resolved
Hide resolved
flink-core/src/test/java/org/apache/flink/types/StringSerializationTest.java
Show resolved
Hide resolved
|
This is nice work, thanks a lot :-) Could you add a test case that ensures the encoding is still the same? Maybe copy the old String read/write logic and compare it with the new one for some random Strings? |
|
@AHeise thanks for all the ideas, I've updated the PR with all the proposals applied. As for Main idea of this PR is to leverage CPU-level parallelism, helping it to process multiple characters at once. But the problem with short strings is that there is nothing to parallelize, so double-scanning overhead starts to kill the performance. The proposed fix is to over-allocate the buffer for short strings, skipping the exact buffer size computation. I've found a tipping point for this approach laying somewhere between 6-8 characters:
The current round of benchmarks: So for large strings the new implementation is much faster, and for short it's not regressing (and even slightly faster). |
|
@pnowojski as for large strings, this implementation is also looking quite nice: |
|
Some thoughts for follow-up:
|
|
@StephanEwen yes, most of the difference comes from multiple single-byte read operations. And CPU cannot parallelize them, as there is a data dependency between characters. Buffering the whole string we improve the situation with parallelism, so CPU can process multiple characters at once. I've considered breaking the serialization format for strings (and even did an experiment with this approach), but there are a ton of side-effects for end-users (like you've mentioned keys in rocksdb) and the only positive result of this was a slight improvement for long non-ascii strings, compared to the implementation in this PR. I guess it's not worth it :) I've also added a test for validating binary compatibility of this change. |
|
@shuttie Got it, makes a lot of sense to me. I was wondering at some point if it would be worthwhile to just "unsafe arraycopy" the char[] from the string to the byte[] in the memory segments or stream buffers. So basically no byte-wise logic at all in the serialization. That would increase the state size (all chars would have two bytes), but might save CPU resources. Have you ever experimented with something that? |
|
@StephanEwen the only issue with this approach is that the internal binary representation of Strings between jvm8 and jvm9+ is different, see the JEP-254 for the details (TLDR: starting from jvm9 ascii strings have compact 1-byte internal representation). So then you will loose the compatibility between JVM versions, that's why it's called Unsafe :) |
|
(for any committer that will be merging this, please wait until the benchmarks are merged first) @shuttie I do not understand the results that you posted here. It's in throughput mode, ops/ms, more is better, and it looks like master is out performing the improved version by quite a lot. Did you miss labeled something? |
|
@pnowojski oops, you're right, I've accidentally swapped the labels. I've updated the huge string benchmark results in the post you linked to. |
…iteString and readString
…ion for short strings
…tringValue.writeString, as there is a better way of dealing with these.
AHeise
left a comment
There was a problem hiding this comment.
LGTM. I have one question regarding wider unicode chars and correctness (no new measurements needed). The answer could be one additional test or just a reply.
Like I said on the benchmark PR, I will not merge this PR today, so that we can have one benchmark run on old master.
| } else if (c < HIGH_BIT14) { | ||
| buffer[position++] = (byte)(c | HIGH_BIT); | ||
| buffer[position++] = (byte)((c >>> 7)); | ||
| } else { |
There was a problem hiding this comment.
So I stumbled upon this piece. Since we are encoding a Java char as bytes, we will never exceed 3 bytes, right?
I think, I would be more assured if we can have a test with a 4 byte unicode. https://www.fileformat.info/info/unicode/char/1f4a3/index.htm
As per my understanding that unicode char would result in two java chars, which are treated individually in this loop, such that they result in 2 bytes each or a total of 4 bytes. Did I get it correctly?
There was a problem hiding this comment.
Java characters (e.g. char and Character type) have internal UTF16 encoding, meaning that they will always be encoded as 16-bit numbers. But there are some symbols called surrogate pairs, which may require a pair of 16-bit characters to be encoded.
So from the wire format point of view, when you do "🌉".length() you will get 2 (as there are two underlying 16-bit characters to encode this symbol), but with "🌉".codePointCount(0,"🌉".length()) you will get 1 (as there only one single symbol). But the string serializer is not operating on symbols and UTF code points, it is operating on the underlying 16-bit numbers, so it must handle it properly out of the box.
There was a problem hiding this comment.
@AHeise I've added a trivial test for surrogate pairs serialization, validating that these cases are still handled properly.
…ng in string serializer
4396fe5 to
e0a4f51
Compare
|
Thanks for all of the work @shuttie. I will try to merge before the feature freeze for 1.10 release, but I have to wait for green travis build, and there is currently quite long build queue. |
…rt in StringSerializationTest
|
Looks like the build passed :-) BTW, I circumvent the Apache Flink Travis Queue by pushing interesting PRs as branches to my personal repository and free Travis account. That works quite alright. |
|
Thanks for the very detailed analysis and nice contribution @shuttie! |
What is the purpose of the change
This PR implements a set of performance optimizations for String serialization and de-serialization. While running a set of state-heavy streaming jobs, we noticed that Flink spends quite a lot of CPU time (~30-40%) doing String encoding and decoding in two places: while transferring messages between the nodes, and while loading and writing objects into the state store.
We did a simple benchmark of String read/write operations compared to a default JDK's DataOutput.writeUTF ano noted a significant performance difference between Flink implementation and the JDK one.
Performance difference was 4x on decoding and 2x on encoding for 16 symbol ascii strings.
For larger strings performance was degrading even more significant, 7x and 4x accordingly:
But JDK DataOutput.writeUTF cannot be directly used as a serializer replacement because of it being incompatible with the current Flink binary format for strings. Also it is not able to write strings longer than 32kb, as it uses 2-byte length encoding.
But if you compare the difference in implementation between these two algorithms, the main important difference is intermediate buffering in JDK, compared to a iterative approach in Flink. This buffering allows HotSpot to do two important optimizations:
We made a simple POC which alters the implementation of
StringValue.writeStringandStringValue.readStringin a way to introduce additional buffering, which significantly improves both encoding and decoding throughput on almost all the workloads.There is also a property-based validation test suite which ensures that old and new serialization code produce exactly the same byte sequences, and doing a round-trip from old to new, new to old, and new to new produce the same results. We ran this suite with random data for ~1h and found no differences.
Benchmark results
Full raw benchmark results are located here. We did a series of benchmarks of three string encoding/decoding impmentations:
StringValue.writeStringandStringValue.readStringStringValue.writeStringandStringValue.readStringimplementations from this PRDataInput.readUTFandDataOutput.writeUTFFor a workload generator we used these parameters:
Ascii strings
So for ascii strings:
on 1-char strings the new implementation is a bit slower than the old one. There was a performance regression, which is fixed in the next commit. For short strings performance is almost the same.Non-ascii strings
So for non-ascii chinese character strings:
Brief change log
StringValue.writeStringandStringValue.readStringmethods with improved ones.Verifying this change
This change is already covered by existing tests, such as StringSerializationTest.
Also this change added additional test cases and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation