Skip to content

Conversation

@Ulincsys
Copy link
Contributor

Description

  • Use errors = "backslashreplace" in decode to avoid non-utf8 characters in commit messages

The git log does not store character encoding metadata afaik, and there is no way to programmatically determine what encoding was used with 100% accuracy. En lieu of that, we can just encode the raw bytes directly into the string as a UTF-8 escaped byte sequence.

As an example:

Commit 25120e32fd761df284df417b7ebfa1cb8560fba7 was encoded with windows-1252, and should read:

Added a link to Thorbjørn's article on SLF4J...

However, this exact byte sequence can be decoded with windows-1250 or windows-1251, or any other number of compatible encodings, and each one would lead to a different UTF8-encoding, respectively:

Added a link to Thorbjřrn's article on SLF4J...

Added a link to Thorbjшrn's article on SLF4J...

This change decodes the above commit message to:

Added a link to Thorbj\xf8rn's article on SLF4J...

Which leaves the original information intact, without the need to guess at which encoding to use.

This PR fixes #3165

Signed commits

  • Yes, I signed my commits.

Signed-off-by: Ulincsys <ulincsys@gmail.com>
Copy link
Member

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sgoggins sgoggins merged commit 47ceea4 into main Jun 25, 2025
11 checks passed
@MoralCode MoralCode deleted the fix-windows-1252-decode branch July 10, 2025 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

analyze_commits_in_parallel error: UnicodeDecodeError

2 participants