Skip to content
This repository has been archived by the owner on Sep 11, 2024. It is now read-only.

replace graphemer by unicode-segmenter #12617

Closed
wants to merge 1 commit into from
Closed

replace graphemer by unicode-segmenter #12617

wants to merge 1 commit into from

Conversation

cometkim
Copy link

@cometkim cometkim commented Jun 13, 2024

I made a Unicode library that is much smaller and faster than graphemer. Check it out: https://github.com/cometkim/unicode-segmenter?tab=readme-ov-file#unicode-segmentergrapheme-vs

  • 2x smaller
  • 6~9x faster
  • ESM/CJS support
  • Latest Unicode data

It ensures compliance with the latest Unicode data by performing tests and fuzzing against the Intl.Segmenter API.

graphemer is still in the bundle as transitive dependency from the @vector-im/compound-web package, so I made PR to it element-hq/compound-web#181

Note: The library may possibly replace emojibase-regex too.

@cometkim cometkim requested review from a team as code owners June 13, 2024 20:41
@github-actions github-actions bot added the Z-Community-PR Issue is solved by a community member's PR label Jun 13, 2024
Signed-off-by: Hyeseong Kim <hey@hyeseong.kim>
@t3chguy
Copy link
Member

t3chguy commented Jun 13, 2024

Based on the comparison in the link, why wouldn't we just switch to Intl.Segmenter altogether? We only support 2 major vers of Chrome & Firefox + 2 minors of Safari so they all support it. Any wins against Segmenter directly?

@t3chguy t3chguy added the T-Task Refactoring, enabling or disabling functionality, other engineering tasks label Jun 13, 2024
@cometkim
Copy link
Author

cometkim commented Jun 13, 2024

Any wins against Segmenter directly?

Yes for runtime perf and compatibility (if matter). That's the same reason graphemer was originally used.

Intl.Segmenter is still too new (especially in Firefox) and underperforms user implementations. It comes from inefficient binding to icu4c, both in Chrome and Safari. (To be fair, they improved a lot in very recent versions)

I'd recommend using Intl.Segmenter where it's ok. However, using unicode-segmenter has some additional advantages, like _catBefore field to minimize duplicate emoji matches in user code.

@cometkim
Copy link
Author

As compound-web now uses Intl.Segmenter. I assume Matrix has the same support range.

@t3chguy If you’d like to use Intl.Segmenter here too, I can update the PR content accordingly.

@t3chguy
Copy link
Member

t3chguy commented Jun 17, 2024

@cometkim the main reason for moving to Segmenter in Compound was bundle size, not for Element but for projects like https://github.com/matrix-org/matrix-authentication-service where it was ~1/3rd of the bundle.

Note: The library may possibly replace emojibase-regex too.

This seems quite interesting in context of #12582 - if it has a way to detect strings which are entirely emoji, excluding textual emoji

cc @robintown

@cometkim
Copy link
Author

You can use /\p{Emoji_Presentation}/u instead of emojibase. If Intl.Segmenter is ok, Unicode RegExp is ok too.

for (const { segment } of segmenter.segment(text)) {
  if (/\p{Emoji_Presentation}/u.test(segment)) {
    const emoji = segment;
  }
}

However, using unicode-segmenter here has a little performance gain, as the Emoji_Presentation property is a subset of the Extended_Pictographic property.

// This adds 1KB gzipped size, or you can Unicode RegExp.
import { isEmojiPresentation } from 'unicode-segmenter/emoji';

import { GraphemeCategory, graphemeSegments } from 'unicode-segmenter/grapheme';

for (const { segment, _cat } of graphemeSegments(text)) {
  if (
    // Check its category first, so reduce unnecessary searching on non-emoji characters
    _cat === GraphemeCategory.Extended_Pictographic &&
    isEmojiPresentation(segment.codePointAt(0))
  ) {
    const emoji = segment;
  }
}

@richvdh richvdh requested review from t3chguy and removed request for richvdh and MidhunSureshR June 25, 2024 11:18
@cometkim cometkim deleted the unicode-segmenter branch June 29, 2024 05:47
@robintown
Copy link
Member

robintown commented Jul 4, 2024

A note: /\p{Emoji_Presentation}/u isn't a full replacement for an emoji regex as it will produce a false negative for ↔️, and a false positive for ✨︎, for instance.

@cometkim
Copy link
Author

cometkim commented Jul 4, 2024

I see, I think adding support for Emoji Sets to unicode-segmenter/emoji is good idea. Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, enabling or disabling functionality, other engineering tasks Z-Community-PR Issue is solved by a community member's PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants