You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[The suggested approach](https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs#line-mode
) for doing line level diffing is the following set of steps:
1. `ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)`
2. `diffs = DiffMain(ti1, ti2)`
3. `DiffCharsToLines(diff, linesIdx)`
The original implementation in `google/diff-match-patch` uses
unicode codepoints for storing indices in `ti1` and `ti2` joined by an empty string.
Current implementation in this repo stores them as integers joined by a
comma. While this implementation makes `ti1` and `ti2` more readable, it
introduces bugs when trying to rely on it when doing line level diffing
with `DiffMain`. The root cause of the issue is that an integer line
index might span more than one character/rune, and `DiffMain` can assume
that two different lines having the same index prefix match partially. For
example, indices 123 and 129 will have partial match `12`. In that
example, the diff will show lines 3 and 9 which is not correct. A simple
failing test case demonstrating this issue is available at
`TestDiffPartialLineIndex`.
In this PR I am adjusting the algorithm to use the same approach as in
[diff-match-patch](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L508-L510
) by storing each line index as a rune.
While a rune in Golang is a type alias to uint32, not every uint32
can be a valid rune. During string to rune slice conversion invalid runes will
be replaced with `utf.RuneError`.
The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding
The first 127 lines will work the fastest as they are represented as a
single bytes. Higher numbers are represented as 2-4 bytes.
In addition to that, the range `U+D800 - U+DFFF` contains
[invalid codepoints](https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling).
and all codepoints higher or equal to `0xD800` are incremented by
`0xDFFF - 0xD800`.
The maximum representable integer using this approach is 1'112'060.
This improves on Javascript implementation which currently
[bails out](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503-L505
) when files have more than 65535 lines.
// unescaper unescapes selected chars for compatibility with JavaScript's encodeURI.
18
23
// In speed critical applications this could be dropped since the receiving application will certainly decode these fine. Note that this function is case-sensitive. Thus "%3F" would not be unescaped. But this is ok because it is only called with the output of HttpUtility.UrlEncode which returns lowercase hex. Example: "%3f" -> "?", "%24" -> "$", etc.
0 commit comments