Context
Surfaced during review of #11 / PR #32 (decode git output with surrogateescape). That fix is correct and tested; this is a separate downstream gap in the --json output, not a regression in #32.
Problem
With surrogateescape decoding, a non-UTF-8 byte in a path or diff line becomes a lone surrogate in the Python str (e.g. 0xe9 -> \udce9). The --json commands serialize these with json.dumps(..., ensure_ascii=True) (the default), which emits the lone surrogate as a \udcXX escape:
$ git-hunk list --json
[{"file": "pass\udce9.txt", ...}]
This is a lone-surrogate JSON string, which downstream consumers handle inconsistently:
- Python's own
json.loads round-trips it.
jq (1.8.x) parses it but replaces it with U+FFFD — the original byte is silently lost (lossy).
- Strict parsers (Go
encoding/json, Rust serde_json, older jq) reject the document outright.
So list --json on a repo containing a non-UTF-8 filename or diff content either corrupts the value or breaks the consumer. Affected sinks: the three json.dumps(...) calls in git_hunk/_cli.py (list --json, and the per-command JSON outputs).
Proposal
Pick a byte-safe representation for non-UTF-8 text in JSON output (maintainer's call):
- Replace undecodable bytes with U+FFFD before serializing (valid, but lossy).
- Emit the raw bytes in a separate, explicitly-encoded field (e.g. base64 / percent-escaped), keeping the human field best-effort.
- Document that
--json assumes UTF-8 paths and fail loudly with a clear error otherwise.
Acceptance criteria
Scope notes
Context
Surfaced during review of #11 / PR #32 (decode git output with
surrogateescape). That fix is correct and tested; this is a separate downstream gap in the--jsonoutput, not a regression in #32.Problem
With
surrogateescapedecoding, a non-UTF-8 byte in a path or diff line becomes a lone surrogate in the Pythonstr(e.g.0xe9->\udce9). The--jsoncommands serialize these withjson.dumps(..., ensure_ascii=True)(the default), which emits the lone surrogate as a\udcXXescape:This is a lone-surrogate JSON string, which downstream consumers handle inconsistently:
json.loadsround-trips it.jq(1.8.x) parses it but replaces it with U+FFFD — the original byte is silently lost (lossy).encoding/json, Rustserde_json, olderjq) reject the document outright.So
list --jsonon a repo containing a non-UTF-8 filename or diff content either corrupts the value or breaks the consumer. Affected sinks: the threejson.dumps(...)calls ingit_hunk/_cli.py(list --json, and the per-command JSON outputs).Proposal
Pick a byte-safe representation for non-UTF-8 text in JSON output (maintainer's call):
--jsonassumes UTF-8 paths and fail loudly with a clear error otherwise.Acceptance criteria
list --jsonon a repo with a non-UTF-8 path or diff line produces output that strict JSON parsers (jq, Go, Rust) accept.Scope notes
--jsonschema): that is about the schema's shape and versioning; this is about byte-safety of the encoding. The two can be addressed together but are not the same gap.