Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextPositionSelector, thoughts about Unicode code *point* vs. UTF16 code *unit* #350

Closed
danielweck opened this issue Aug 22, 2016 · 10 comments
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@danielweck
Copy link
Member

danielweck commented Aug 22, 2016

Hello all,
CC @iherman @azaroth42

In the EPUB3 CFI (Canonical Fragment Identifier) specification, which has a possible use in "Open Annotation in EPUB" ( http://www.idpf.org/epub/oa/ ), character-level offsets are defined as UTF16 code units, not Unicode code points.

Current implementations of CFI (parsing / processing libraries, and text highlighting / rendering tools) that are written in Javascript benefit from direct code unit support (i.e. no handling / translation of Unicode surrogate pairs, etc.) in the DOM Range API and in the ECMAScript string API. See my comment here: w3c/epub-specs#555 (comment)

So, although this design approach seems to work pretty well in EPUB3 / XHTML5, I wonder whether this is also relevant in the broader Open Web Platform context. For example, would a Javascript implementation of TextPositionSelector ( http://w3c.github.io/web-annotation/selector-note/#TextPositionSelector_def ) need to translate back and forth between Unicode code points and UTF16 code units, in order for the data to flow between the serialization format and the consuming web APIs?

Any other thoughts?

PS, I am "cross-posting" here too w3c/epub-specs#555 (comment)

@iherman
Copy link
Member

iherman commented Aug 22, 2016

@r12a @aphillips, any comment on this; see also the separate, but related issue linked from the note...

Cc @azaroth42

@azaroth42
Copy link
Collaborator

We knew about this at the time. The problem is that other fragment identifiers go the opposite way (e.g. plain text RFC says code points) and hence there's no way to have both be possible at once. So, we went with the results of the i18n discussion, that code points (while more difficult to implement in current javascript frameworks) were more correct and more useful.

@tilgovi
Copy link
Contributor

tilgovi commented Aug 22, 2016

I've thought about this a little. I'm not sure just how bad the performance would be handling code units, but I opened an issue a while back on my text seeking utility library: tilgovi/dom-seek#1

If someone wants to implement that we can benchmark.

@danielweck
Copy link
Member Author

@tkanai
Copy link
Contributor

tkanai commented Aug 23, 2016

@danielweck Strictly speaking, EPUB CFI spec does not support any codes in surrogate blocks, and it would be natural for EPUB CFI to use Unicode code unit. I think textPositionSelector based on Unicode code point could be a good option for those who would like to apply Unicode code point basis selector for EPUB.

@danielweck
Copy link
Member Author

Thanks @tkanai I agree that higher-level processing on Unicode 'code points' basis has its benefits (notably, text selections / character ranges are functionally closer to how human-readable languages / scripts are structured), but I was wondering about implementation feasibility and costs (in particular: performance).

The use of UTF16 'code units' in EPUB3 CFI is consistent with the overall "low level" design (e.g. canonical syntax for XML element path based on numbered node references). So yes, CFI character ranges are totally unaware of Unicode "subtleties" such as grapheme clusters and surrogate pairs, which means that a CFI-authoring user interface must capture and constrain/adjust text selections in such a way that they make logical sense from the user's perspective (whilst the underlying CFI processor itself does not need to be "Unicode aware" to that degree). Web browsers implement high-level text selection pretty well already, so the responsibility of a typical CFI processing library basically boils down to handling the low-level UTF16-aware (UCS2) output from DOM Ranges or JavaScript string API (no need for sophisticated Punycode -like Unicode utilities).

So, I am by no means claiming that the CFI model is applicable / superior to TextPositionSelector, I am just wondering about the pros and cons s :)

@azaroth42
Copy link
Collaborator

I think we can close the issue as a duplicate of all the previous times we've gone around on code point vs code unit? :)

@danielweck
Copy link
Member Author

sure :)

@aphillips
Copy link

@azaroth42: +1

While working in code points is awesome, the reality of the Web is often that of UTF-16 code units because of DOM String. While the APIs and data structures based on UTF-16 code units do not directly insulate users from problems with surrogate pairs (and, neither surrogates handling nor code point counting deal at all with grapheme clustering), proper character handling can and should still be provided by higher level implementation and protocols.

No process needs to deal with surrogate code points (that is, character values in the range U+D800 to U+DFFF). There is no reason to state, however, that, just because offsets are defined in UTF-16 code units that a process cannot handle supplementary characters (i.e. characters represented by a surrogate pair of code units)

I18N WG commented about an identical issue at TPAC, but I'm at a loss to put my finger on it just now.

@azaroth42
Copy link
Collaborator

Thanks both!

@plehegar plehegar added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. and removed i18n-review labels Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
None yet
Development

No branches or pull requests

7 participants