Encode non-utf8 chars as bytes in analyze_commit #3196
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
errors = "backslashreplace"in decode to avoid non-utf8 characters in commit messagesThe git log does not store character encoding metadata afaik, and there is no way to programmatically determine what encoding was used with 100% accuracy. En lieu of that, we can just encode the raw bytes directly into the string as a UTF-8 escaped byte sequence.
As an example:
Commit 25120e32fd761df284df417b7ebfa1cb8560fba7 was encoded with
windows-1252, and should read:However, this exact byte sequence can be decoded with
windows-1250orwindows-1251, or any other number of compatible encodings, and each one would lead to a different UTF8-encoding, respectively:This change decodes the above commit message to:
Which leaves the original information intact, without the need to guess at which encoding to use.
This PR fixes #3165
Signed commits