Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSONFlattenerMaker: Speed up charsetFix. #16212

Merged
merged 1 commit into from
Apr 26, 2024
Merged

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Mar 27, 2024

JSON parsing has this function charsetFix that fixes up strings so they can round-trip through UTF-8 encoding without loss of fidelity. It was originally introduced to fix a bug where strings could be sorted, encoded, then decoded, and the resulting decoded strings could end up no longer in sorted order (due to character swaps during the encode operation).

The code has been in place for some time, and only applies to JSON. I am not sure if it needs to apply to other formats; it's certainly more difficult to get broken strings from other formats. It's easy in JSON because you can write a JSON string like "foo\uD900".

At any rate, this patch does not revisit whether charsetFix should be applied to all formats. It merely optimizes it for the JSON case. The function works by using CharsetEncoder.canEncode, which is a relatively slow method (just as expensive as actually encoding). This patch adds a short-circuit to skip canEncode if all chars in a string are in the basic multilingual plane (i.e. if no chars are surrogates).

Benchmarks:

master

Benchmark                              (discovery)  (readerTypeString)  Mode  Cnt     Score    Error  Units
JsonInputFormatBenchmark.parseAndRead        false              reader  avgt   10  2645.716 ± 24.261  ns/op

patch

Benchmark                              (discovery)  (readerTypeString)  Mode  Cnt     Score    Error  Units
JsonInputFormatBenchmark.parseAndRead        false              reader  avgt   10  2307.164 ± 36.656  ns/op

JSON parsing has this function "charsetFix" that fixes up strings
so they can round-trip through UTF-8 encoding without loss of
fidelity. It was originally introduced to fix a bug where strings
could be sorted, encoded, then decoded, and the resulting decoded
strings could end up no longer in sorted order (due to character
swaps during the encode operation).

The code has been in place for some time, and only applies to JSON.
I am not sure if it needs to apply to other formats; it's certainly
more difficult to get broken strings from other formats. It's easy
in JSON because you can write a JSON string like "foo\uD900".

At any rate, this patch does not revisit whether charsetFix should
be applied to all formats. It merely optimizes it for the JSON case.
The function works by using CharsetEncoder.canEncode, which is
a relatively slow method (just as expensive as actually encoding).
This patch adds a short-circuit to skip canEncode if all chars in
a string are in the basic multilingual plane (i.e. if no chars are
surrogates).
@gianm gianm force-pushed the jfm-charset-fix branch from c6a4c6a to e2a28a8 Compare March 27, 2024 18:09
Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the JMH benchmarks. I was worried for another O(N) iteration but I guess the overhead would be worthwhile since roundTrip is skipped.

@cryptoe cryptoe merged commit 64a6fc8 into apache:master Apr 26, 2024
85 checks passed
@cryptoe cryptoe added this to the 30.0.0 milestone Apr 26, 2024
@gianm gianm deleted the jfm-charset-fix branch May 1, 2024 08:10
adarshsanjeev pushed a commit to adarshsanjeev/druid that referenced this pull request May 6, 2024
JSON parsing has this function "charsetFix" that fixes up strings
so they can round-trip through UTF-8 encoding without loss of
fidelity. It was originally introduced to fix a bug where strings
could be sorted, encoded, then decoded, and the resulting decoded
strings could end up no longer in sorted order (due to character
swaps during the encode operation).

The code has been in place for some time, and only applies to JSON.
I am not sure if it needs to apply to other formats; it's certainly
more difficult to get broken strings from other formats. It's easy
in JSON because you can write a JSON string like "foo\uD900".

At any rate, this patch does not revisit whether charsetFix should
be applied to all formats. It merely optimizes it for the JSON case.
The function works by using CharsetEncoder.canEncode, which is
a relatively slow method (just as expensive as actually encoding).
This patch adds a short-circuit to skip canEncode if all chars in
a string are in the basic multilingual plane (i.e. if no chars are
surrogates).
LakshSingla pushed a commit that referenced this pull request May 7, 2024
JSON parsing has this function "charsetFix" that fixes up strings
so they can round-trip through UTF-8 encoding without loss of
fidelity. It was originally introduced to fix a bug where strings
could be sorted, encoded, then decoded, and the resulting decoded
strings could end up no longer in sorted order (due to character
swaps during the encode operation).

The code has been in place for some time, and only applies to JSON.
I am not sure if it needs to apply to other formats; it's certainly
more difficult to get broken strings from other formats. It's easy
in JSON because you can write a JSON string like "foo\uD900".

At any rate, this patch does not revisit whether charsetFix should
be applied to all formats. It merely optimizes it for the JSON case.
The function works by using CharsetEncoder.canEncode, which is
a relatively slow method (just as expensive as actually encoding).
This patch adds a short-circuit to skip canEncode if all chars in
a string are in the basic multilingual plane (i.e. if no chars are
surrogates).

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants