Patch margin index splits Unicode surrogate pairs unexpectedly #149
Description
When the function diff_match_patch.prototype.patch_addContext_
adds context to a patch, it increments/decreases the index by a constant, Patch_Margin = 4
. However, since JavaScript's substring
function operates with UTF-16 code unit indexing, there's a chance that Patch_Margin
may split a Unicode surrogate pair.
Consider the following example:
import diff_match_patch from "diff-match-patch";
console.log(
JSON.stringify(
new diff_match_patch().patch_make("🧮 **a", "🧮 **")[0].diffs[0][1],
)
);
The output is "\uddee **"
(🧮 corresponds to "\ud83e\uddee"
).
If you attempt to use diff_match_patch.patch_obj.prototype.toString
on this patch, it leads to a crash. encodeURI
will throw a URIError
if URI contains a lone surrogate.
import diff_match_patch from "diff-match-patch";
const diff = new diff_match_patch();
console.log(
JSON.stringify(
diff.patch_toText(diff.patch_make("🧮 **a", "🧮 **")) // URIError: URI malformed
)
);
A straightforward solution might involve adding a verification step after applying Patch_Margin
to ensure the indices remain valid. I can start a PR, but I've noticed that Patch_Margin
is used in many places, and I'm unsure about the best way to make changes.