replace graphemer by unicode-segmenter #12617

cometkim · 2024-06-13T20:41:08Z

I made a Unicode library that is much smaller and faster than graphemer. Check it out: https://github.com/cometkim/unicode-segmenter?tab=readme-ov-file#unicode-segmentergrapheme-vs

2x smaller
6~9x faster
ESM/CJS support
Latest Unicode data

It ensures compliance with the latest Unicode data by performing tests and fuzzing against the Intl.Segmenter API.

graphemer is still in the bundle as transitive dependency from the @vector-im/compound-web package, so I made PR to it element-hq/compound-web#181

Note: The library may possibly replace emojibase-regex too.

Signed-off-by: Hyeseong Kim <hey@hyeseong.kim>

t3chguy · 2024-06-13T22:21:32Z

Based on the comparison in the link, why wouldn't we just switch to Intl.Segmenter altogether? We only support 2 major vers of Chrome & Firefox + 2 minors of Safari so they all support it. Any wins against Segmenter directly?

cometkim · 2024-06-13T23:08:23Z

Any wins against Segmenter directly?

Yes for runtime perf and compatibility (if matter). That's the same reason graphemer was originally used.

Intl.Segmenter is still too new (especially in Firefox) and underperforms user implementations. It comes from inefficient binding to icu4c, both in Chrome and Safari. (To be fair, they improved a lot in very recent versions)

I'd recommend using Intl.Segmenter where it's ok. However, using unicode-segmenter has some additional advantages, like _catBefore field to minimize duplicate emoji matches in user code.

cometkim · 2024-06-16T18:19:20Z

As compound-web now uses Intl.Segmenter. I assume Matrix has the same support range.

@t3chguy If you’d like to use Intl.Segmenter here too, I can update the PR content accordingly.

t3chguy · 2024-06-17T09:09:29Z

@cometkim the main reason for moving to Segmenter in Compound was bundle size, not for Element but for projects like https://github.com/matrix-org/matrix-authentication-service where it was ~1/3rd of the bundle.

Note: The library may possibly replace emojibase-regex too.

This seems quite interesting in context of #12582 - if it has a way to detect strings which are entirely emoji, excluding textual emoji

cc @robintown

cometkim · 2024-06-17T11:06:52Z

You can use /\p{Emoji_Presentation}/u instead of emojibase. If Intl.Segmenter is ok, Unicode RegExp is ok too.

for (const { segment } of segmenter.segment(text)) {
  if (/\p{Emoji_Presentation}/u.test(segment)) {
    const emoji = segment;
  }
}

However, using unicode-segmenter here has a little performance gain, as the Emoji_Presentation property is a subset of the Extended_Pictographic property.

// This adds 1KB gzipped size, or you can Unicode RegExp.
import { isEmojiPresentation } from 'unicode-segmenter/emoji';

import { GraphemeCategory, graphemeSegments } from 'unicode-segmenter/grapheme';

for (const { segment, _cat } of graphemeSegments(text)) {
  if (
    // Check its category first, so reduce unnecessary searching on non-emoji characters
    _cat === GraphemeCategory.Extended_Pictographic &&
    isEmojiPresentation(segment.codePointAt(0))
  ) {
    const emoji = segment;
  }
}

robintown · 2024-07-04T17:08:31Z

A note: /\p{Emoji_Presentation}/u isn't a full replacement for an emoji regex as it will produce a false negative for ↔️, and a false positive for ✨︎, for instance.

cometkim · 2024-07-04T17:16:36Z

I see, I think adding support for Emoji Sets to unicode-segmenter/emoji is good idea. Thanks

cometkim requested review from a team as code owners June 13, 2024 20:41

cometkim requested review from MidhunSureshR and richvdh June 13, 2024 20:41

github-actions bot added the Z-Community-PR Issue is solved by a community member's PR label Jun 13, 2024

replace graphemer by unicode-segmenter

69df1c5

Signed-off-by: Hyeseong Kim <hey@hyeseong.kim>

t3chguy added the T-Task Refactoring, enabling or disabling functionality, other engineering tasks label Jun 13, 2024

github-actions bot deployed to EndToEndTests June 13, 2024 22:43 View deployment

github-actions bot deployed to Netlify June 13, 2024 22:43 View deployment

cometkim mentioned this pull request Jun 14, 2024

replace graphemer by unicode-segmenter element-hq/compound-web#181

Closed

richvdh requested review from t3chguy and removed request for richvdh and MidhunSureshR June 25, 2024 11:18

t3chguy mentioned this pull request Jun 25, 2024

Switch from graphemer to Intl.Segmenter #12697

Merged

t3chguy closed this in #12697 Jun 26, 2024

cometkim deleted the unicode-segmenter branch June 29, 2024 05:47

wodny mentioned this pull request Jul 8, 2024

Intl.Segmenter is not a constructor - unable to use Element Web on Firefox ESR 115.12.0esr (64-bit) element-hq/element-web#27682

Closed

theres-waldo mentioned this pull request Jul 16, 2024

Make sure Element Web works with Firefox ESR element-hq/element-web#27684

Closed

MarcWadai mentioned this pull request Jul 17, 2024

Upgrade element 1.11.72 tchapgouv/tchap-web-v4#1059

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

replace graphemer by unicode-segmenter #12617

replace graphemer by unicode-segmenter #12617

Uh oh!

cometkim commented Jun 13, 2024 •

edited

Loading

Uh oh!

t3chguy commented Jun 13, 2024

Uh oh!

cometkim commented Jun 13, 2024 •

edited

Loading

Uh oh!

cometkim commented Jun 16, 2024

Uh oh!

t3chguy commented Jun 17, 2024

Uh oh!

cometkim commented Jun 17, 2024

Uh oh!

robintown commented Jul 4, 2024 •

edited

Loading

Uh oh!

cometkim commented Jul 4, 2024

Uh oh!

Uh oh!

Uh oh!

replace graphemer by unicode-segmenter #12617

replace graphemer by unicode-segmenter #12617

Uh oh!

Conversation

cometkim commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t3chguy commented Jun 13, 2024

Uh oh!

cometkim commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cometkim commented Jun 16, 2024

Uh oh!

t3chguy commented Jun 17, 2024

Uh oh!

cometkim commented Jun 17, 2024

Uh oh!

robintown commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cometkim commented Jul 4, 2024

Uh oh!

Uh oh!

cometkim commented Jun 13, 2024 •

edited

Loading

cometkim commented Jun 13, 2024 •

edited

Loading

robintown commented Jul 4, 2024 •

edited

Loading