Skip to content

is our realtime sync implementation for strings impacted by an issue with surrogate pairs in diff match patch? Yes, but not much. #6327

Open
@williamstein

Description

@williamstein

"Surrogate pairs are strings that contain a supplemental code point (especially emojis) that cause diff indices to be offset. It can either mess up the text or cause DMP to error (within toDelta/fromDelta)." See

I don't know whether, or to what extent, this might impact cocalc. I've so far never been aware of such an issue. Maybe (?) when CoCalc hits it, an error is thrown, and our diff algorithm generates a very large diff that is just "replace the entire document by this other one", so for us things are not efficient, but not broken either. I don't know. It also might be very unlikely to hit in the context of Jupyter notebooks, where most text is ascii, and markdown where we usually write emojis as :thing: instead of unicode.

In any case, I'll look into this and report back here. Since the original author of DMP doesn't maintain it anymore, it could also make sense to try to modernize the library and make a new independent supported version, which contains fixes for the above issue. As some motivation, the @cocalc/util package has a copy of dmp with at least one important bugfix (from my point of view). That's in https://github.com/sagemathinc/cocalc/blob/master/src/packages/util/dmp.js

NOTE: the file dmp.js had a GPL header applied to it by some automated script that @haraldschilly wrote. However, I just fixed that and reverted the license header back to the original Apache V2 license.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions