Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for sentence break suppression (-u-ss) #3927

Open
sffc opened this issue Aug 23, 2023 · 3 comments
Open

Add support for sentence break suppression (-u-ss) #3927

sffc opened this issue Aug 23, 2023 · 3 comments
Assignees
Labels
C-segmentation Component: Segmentation S-small Size: One afternoon (small bug fix or enhancement) U-ecma402 User: ECMA-402 compatibility

Comments

@sffc
Copy link
Member

sffc commented Aug 23, 2023

It is in the Unicode UTS 35 spec, and there is a proposal to add it to ECMA-402. We should support it in ICU4X.

@sffc sffc added C-segmentation Component: Segmentation S-small Size: One afternoon (small bug fix or enhancement) labels Aug 23, 2023
@sffc sffc added this to the 1.4 Blocking ⟨P1⟩ milestone Sep 21, 2023
@sffc
Copy link
Member Author

sffc commented Sep 21, 2023

Assigning to @eggrobin since you're already in the thick of sentence segmentation.

@sffc sffc added the U-ecma402 User: ECMA-402 compatibility label Sep 21, 2023
@sffc
Copy link
Member Author

sffc commented Sep 18, 2024

To be clear, we're talking about this data:

https://github.com/unicode-org/cldr/blob/main/common/segments/en.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE ldml SYSTEM "../../common/dtd/ldml.dtd">
<ldml>
  <identity>
    <version number="$Revision$"/>
    <language type="en"/>
  </identity>
  <segmentations>
    <segmentation type="SentenceBreak">
      <!--From ULI data, http://uli.unicode.org-->
      <suppressions type="standard">
        <suppression>L.P.</suppression>
        <suppression>Alt.</suppression>
        <suppression>Approx.</suppression>
        <suppression>E.G.</suppression>
        <suppression>O.</suppression>
        <suppression>Maj.</suppression>
        <suppression>Misc.</suppression>

@sffc
Copy link
Member Author

sffc commented Sep 18, 2024

CC @makotokato

If ICU4C has a trie, you could re-use it. Else, it's perfectly fine to build a trie in ICU4X datagen. You can use zerotrie::ZeroTriePerfectHash, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation S-small Size: One afternoon (small bug fix or enhancement) U-ecma402 User: ECMA-402 compatibility
Projects
None yet
Development

No branches or pull requests

2 participants