Skip to content

Decode multibyte UTF-8 paths in quoted diff headers (_unquote_c_path mojibakes per-byte octal escapes) #70

Description

@wkentaro

This was generated by AI during PR processing.

Context

Surfaced while finalizing PR #68 (parse git's quoted, C-escaped diff headers, issue #30). PR #68 is correct for its scope — the ASCII chars git quotes regardless of core.quotePath (tab, newline, backslash, double-quote). This is a separate, out-of-scope gap on the decode path.

Problem

_unquote_c_path in git_hunk/_hunk.py decodes each octal escape one byte at a time:

chars.append(chr(int(path[i + 1 : i + 4], 8)))

With the default core.quotePath=true, git octal-escapes every byte >= 0x80, so a multibyte UTF-8 path is emitted as a sequence of per-byte octal escapes. Example: café.txt → header "a/caf\303\251.txt". The current code produces chr(0o303) + chr(0o251) = é (U+00C3 U+00A9), i.e. the UTF-8 bytes mis-decoded as Latin-1 — mojibake that does not match the real filename. The correct decode collects the escaped bytes and decodes them as a unit (UTF-8, ideally with the same surrogateescape strategy used elsewhere per #11/#32).

Repro (default git config):

$ git init && printf 'a\nb\n' > café.txt && git add . && git commit -m init
$ printf 'a\nB\n' > café.txt
$ git-hunk list --unstaged --json   # "file" comes back as "café.txt", not "café.txt"

Impact

list reports the wrong path for any non-ASCII filename, and subsequent stage/unstage/discard keyed off that path target the wrong (nonexistent) file.

Scope notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    ready-for-humanissue: Requires human implementationtype: bugissue: Reporting a defect to fix

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions