Skip to content
This repository was archived by the owner on Aug 5, 2024. It is now read-only.
This repository was archived by the owner on Aug 5, 2024. It is now read-only.

Patch margin index splits Unicode surrogate pairs unexpectedly #149

Open
@kkshinkai

Description

@kkshinkai

When the function diff_match_patch.prototype.patch_addContext_ adds context to a patch, it increments/decreases the index by a constant, Patch_Margin = 4. However, since JavaScript's substring function operates with UTF-16 code unit indexing, there's a chance that Patch_Margin may split a Unicode surrogate pair.

Consider the following example:

import diff_match_patch from "diff-match-patch";

console.log(
  JSON.stringify(
    new diff_match_patch().patch_make("🧮 **a", "🧮 **")[0].diffs[0][1],
  )
);

The output is "\uddee **" (🧮 corresponds to "\ud83e\uddee").

If you attempt to use diff_match_patch.patch_obj.prototype.toString on this patch, it leads to a crash. encodeURI will throw a URIError if URI contains a lone surrogate.

import diff_match_patch from "diff-match-patch";

const diff = new diff_match_patch();

console.log(
  JSON.stringify(
    diff.patch_toText(diff.patch_make("🧮 **a", "🧮 **")) // URIError: URI malformed
  )
);

A straightforward solution might involve adding a verification step after applying Patch_Margin to ensure the indices remain valid. I can start a PR, but I've noticed that Patch_Margin is used in many places, and I'm unsure about the best way to make changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions