JSONFlattenerMaker: Speed up charsetFix. #16212

gianm · 2024-03-27T18:08:49Z

JSON parsing has this function charsetFix that fixes up strings so they can round-trip through UTF-8 encoding without loss of fidelity. It was originally introduced to fix a bug where strings could be sorted, encoded, then decoded, and the resulting decoded strings could end up no longer in sorted order (due to character swaps during the encode operation).

The code has been in place for some time, and only applies to JSON. I am not sure if it needs to apply to other formats; it's certainly more difficult to get broken strings from other formats. It's easy in JSON because you can write a JSON string like "foo\uD900".

At any rate, this patch does not revisit whether charsetFix should be applied to all formats. It merely optimizes it for the JSON case. The function works by using CharsetEncoder.canEncode, which is a relatively slow method (just as expensive as actually encoding). This patch adds a short-circuit to skip canEncode if all chars in a string are in the basic multilingual plane (i.e. if no chars are surrogates).

Benchmarks:

master

Benchmark                              (discovery)  (readerTypeString)  Mode  Cnt     Score    Error  Units
JsonInputFormatBenchmark.parseAndRead        false              reader  avgt   10  2645.716 ± 24.261  ns/op

patch

Benchmark                              (discovery)  (readerTypeString)  Mode  Cnt     Score    Error  Units
JsonInputFormatBenchmark.parseAndRead        false              reader  avgt   10  2307.164 ± 36.656  ns/op

JSON parsing has this function "charsetFix" that fixes up strings so they can round-trip through UTF-8 encoding without loss of fidelity. It was originally introduced to fix a bug where strings could be sorted, encoded, then decoded, and the resulting decoded strings could end up no longer in sorted order (due to character swaps during the encode operation). The code has been in place for some time, and only applies to JSON. I am not sure if it needs to apply to other formats; it's certainly more difficult to get broken strings from other formats. It's easy in JSON because you can write a JSON string like "foo\uD900". At any rate, this patch does not revisit whether charsetFix should be applied to all formats. It merely optimizes it for the JSON case. The function works by using CharsetEncoder.canEncode, which is a relatively slow method (just as expensive as actually encoding). This patch adds a short-circuit to skip canEncode if all chars in a string are in the basic multilingual plane (i.e. if no chars are surrogates).

cryptoe

Thanks for the JMH benchmarks. I was worried for another O(N) iteration but I guess the overhead would be worthwhile since roundTrip is skipped.

JSON parsing has this function "charsetFix" that fixes up strings so they can round-trip through UTF-8 encoding without loss of fidelity. It was originally introduced to fix a bug where strings could be sorted, encoded, then decoded, and the resulting decoded strings could end up no longer in sorted order (due to character swaps during the encode operation). The code has been in place for some time, and only applies to JSON. I am not sure if it needs to apply to other formats; it's certainly more difficult to get broken strings from other formats. It's easy in JSON because you can write a JSON string like "foo\uD900". At any rate, this patch does not revisit whether charsetFix should be applied to all formats. It merely optimizes it for the JSON case. The function works by using CharsetEncoder.canEncode, which is a relatively slow method (just as expensive as actually encoding). This patch adds a short-circuit to skip canEncode if all chars in a string are in the basic multilingual plane (i.e. if no chars are surrogates).

JSON parsing has this function "charsetFix" that fixes up strings so they can round-trip through UTF-8 encoding without loss of fidelity. It was originally introduced to fix a bug where strings could be sorted, encoded, then decoded, and the resulting decoded strings could end up no longer in sorted order (due to character swaps during the encode operation). The code has been in place for some time, and only applies to JSON. I am not sure if it needs to apply to other formats; it's certainly more difficult to get broken strings from other formats. It's easy in JSON because you can write a JSON string like "foo\uD900". At any rate, this patch does not revisit whether charsetFix should be applied to all formats. It merely optimizes it for the JSON case. The function works by using CharsetEncoder.canEncode, which is a relatively slow method (just as expensive as actually encoding). This patch adds a short-circuit to skip canEncode if all chars in a string are in the basic multilingual plane (i.e. if no chars are surrogates). Co-authored-by: Gian Merlino <gianmerlino@gmail.com>

gianm force-pushed the jfm-charset-fix branch from c6a4c6a to e2a28a8 Compare March 27, 2024 18:09

pranavbhole approved these changes Apr 25, 2024

View reviewed changes

cryptoe approved these changes Apr 26, 2024

View reviewed changes

cryptoe merged commit 64a6fc8 into apache:master Apr 26, 2024
85 checks passed

cryptoe added this to the 30.0.0 milestone Apr 26, 2024

gianm deleted the jfm-charset-fix branch May 1, 2024 08:10

adarshsanjeev mentioned this pull request May 6, 2024

[Backport] JSONFlattenerMaker: Speed up charsetFix. #16400

Merged

adarshsanjeev mentioned this pull request May 28, 2024

[DRAFT] 30.0.0 release notes #16505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSONFlattenerMaker: Speed up charsetFix. #16212

JSONFlattenerMaker: Speed up charsetFix. #16212

gianm commented Mar 27, 2024 •

edited

Loading

cryptoe left a comment •

edited

Loading

JSONFlattenerMaker: Speed up charsetFix. #16212

JSONFlattenerMaker: Speed up charsetFix. #16212

Conversation

gianm commented Mar 27, 2024 • edited Loading

cryptoe left a comment • edited Loading

Choose a reason for hiding this comment

gianm commented Mar 27, 2024 •

edited

Loading

cryptoe left a comment •

edited

Loading