Skip to content

Emit byte-safe JSON for non-UTF-8 paths and content (lone surrogates break strict parsers) #44

Description

@wkentaro

Context

Surfaced during review of #11 / PR #32 (decode git output with surrogateescape). That fix is correct and tested; this is a separate downstream gap in the --json output, not a regression in #32.

Problem

With surrogateescape decoding, a non-UTF-8 byte in a path or diff line becomes a lone surrogate in the Python str (e.g. 0xe9 -> \udce9). The --json commands serialize these with json.dumps(..., ensure_ascii=True) (the default), which emits the lone surrogate as a \udcXX escape:

$ git-hunk list --json
[{"file": "pass\udce9.txt", ...}]

This is a lone-surrogate JSON string, which downstream consumers handle inconsistently:

  • Python's own json.loads round-trips it.
  • jq (1.8.x) parses it but replaces it with U+FFFD — the original byte is silently lost (lossy).
  • Strict parsers (Go encoding/json, Rust serde_json, older jq) reject the document outright.

So list --json on a repo containing a non-UTF-8 filename or diff content either corrupts the value or breaks the consumer. Affected sinks: the three json.dumps(...) calls in git_hunk/_cli.py (list --json, and the per-command JSON outputs).

Proposal

Pick a byte-safe representation for non-UTF-8 text in JSON output (maintainer's call):

  1. Replace undecodable bytes with U+FFFD before serializing (valid, but lossy).
  2. Emit the raw bytes in a separate, explicitly-encoded field (e.g. base64 / percent-escaped), keeping the human field best-effort.
  3. Document that --json assumes UTF-8 paths and fail loudly with a clear error otherwise.

Acceptance criteria

  • list --json on a repo with a non-UTF-8 path or diff line produces output that strict JSON parsers (jq, Go, Rust) accept.
  • The chosen representation is covered by a test that asserts the output parses cleanly.

Scope notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    ready-for-humanissue: Requires human implementationtype: bugissue: Reporting a defect to fix

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions