[HUDI-6825] Use UTF_8 to encode String to byte array in all places #9634
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Logs
This PR unifies the encoding of Java
String
to byte array in Hudi, especially for writing bytes to the storage, by usingUTF_8
encoding only. There are places callingString.getBytes()
which are fixed by this PR.String.getBytes()
uses the platform's default charset and encoding scheme. Note that the default character encoding scheme on Windows is ANSI, while the default character encoding scheme on Linux is UTF-8. The PR has no impact on Linux systems writing and reading Hudi tables.These are the places that used
String.getBytes()
before, which are fixed to useUTF_8
encoding.We don't intend to make these backwards compatible for Windows which can be broken because of different default charset.
Impact
Make sure the encoding of Java
String
and storage bytes in a Hudi table does not depend on platforms.Risk level
low
Documentation Update
N/A
Contributor's checklist