[SPARK-9238][SQL] Remove two extra useless entries for bytesOfCodePointInUTF8#7582
[SPARK-9238][SQL] Remove two extra useless entries for bytesOfCodePointInUTF8#7582zhichao-li wants to merge 1 commit intoapache:masterfrom
Conversation
|
LGTM |
|
cc @davies |
|
Test build #38026 has finished for PR 7582 at commit
|
|
@zhichao-li Two entries are enough for correctness. 254 and 255 are invalid, using |
|
@davies, currently if the first byte is 254 or 255, |
|
I think it's better to raise an exception than parse it in wrong way silently. If we want to have better behavior, then it should be done case by case for every function, it's not trivial to me. So I'd like to peek 3), not to have this two additional entries. |
|
yeah, that's what this pr target to, I guess it's ready to be merged? |
|
LGTM, merging this into master and 1.4! |
…intInUTF8 Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in `bytesOfCodePointInUTF8` for the case of 6 bytes codepoint(1111110x) is enough. Details can be found from https://en.wikipedia.org/wiki/UTF-8 in "Description" section. Author: zhichao.li <zhichao.li@intel.com> Closes #7582 from zhichao-li/utf8 and squashes the following commits: 8bddd01 [zhichao.li] two extra entries (cherry picked from commit 846cf46) Signed-off-by: Davies Liu <davies.liu@gmail.com> Conflicts: unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in
bytesOfCodePointInUTF8for the case of 6 bytes codepoint(1111110x) is enough.Details can be found from https://en.wikipedia.org/wiki/UTF-8 in "Description" section.