Description
"Surrogate pairs are strings that contain a supplemental code point (especially emojis) that cause diff indices to be offset. It can either mess up the text or cause DMP to error (within toDelta/fromDelta)." See
- JavaScript implementation crashes on Unicode code points google/diff-match-patch#10
- Diff breaks unicode characters for emojis google/diff-match-patch#59
- patch_obj.toString blows up with emojis google/diff-match-patch#68
I don't know whether, or to what extent, this might impact cocalc. I've so far never been aware of such an issue. Maybe (?) when CoCalc hits it, an error is thrown, and our diff algorithm generates a very large diff that is just "replace the entire document by this other one", so for us things are not efficient, but not broken either. I don't know. It also might be very unlikely to hit in the context of Jupyter notebooks, where most text is ascii, and markdown where we usually write emojis as :thing:
instead of unicode.
In any case, I'll look into this and report back here. Since the original author of DMP doesn't maintain it anymore, it could also make sense to try to modernize the library and make a new independent supported version, which contains fixes for the above issue. As some motivation, the @cocalc/util package has a copy of dmp with at least one important bugfix (from my point of view). That's in https://github.com/sagemathinc/cocalc/blob/master/src/packages/util/dmp.js
NOTE: the file dmp.js had a GPL header applied to it by some automated script that @haraldschilly wrote. However, I just fixed that and reverted the license header back to the original Apache V2 license.