TextPositionSelector, thoughts about Unicode code point vs. UTF16 code unit #350

danielweck · 2016-08-22T13:24:37Z

In the EPUB3 CFI (Canonical Fragment Identifier) specification, which has a possible use in "Open Annotation in EPUB" ( http://www.idpf.org/epub/oa/ ), character-level offsets are defined as UTF16 code units, not Unicode code points.

Current implementations of CFI (parsing / processing libraries, and text highlighting / rendering tools) that are written in Javascript benefit from direct code unit support (i.e. no handling / translation of Unicode surrogate pairs, etc.) in the DOM Range API and in the ECMAScript string API. See my comment here: w3c/epub-specs#555 (comment)

So, although this design approach seems to work pretty well in EPUB3 / XHTML5, I wonder whether this is also relevant in the broader Open Web Platform context. For example, would a Javascript implementation of TextPositionSelector ( http://w3c.github.io/web-annotation/selector-note/#TextPositionSelector_def ) need to translate back and forth between Unicode code points and UTF16 code units, in order for the data to flow between the serialization format and the consuming web APIs?

Any other thoughts?

PS, I am "cross-posting" here too w3c/epub-specs#555 (comment)

iherman · 2016-08-22T13:40:32Z

@r12a @aphillips, any comment on this; see also the separate, but related issue linked from the note...

Cc @azaroth42

azaroth42 · 2016-08-22T15:19:10Z

We knew about this at the time. The problem is that other fragment identifiers go the opposite way (e.g. plain text RFC says code points) and hence there's no way to have both be possible at once. So, we went with the results of the i18n discussion, that code points (while more difficult to implement in current javascript frameworks) were more correct and more useful.

tilgovi · 2016-08-22T15:21:03Z

I've thought about this a little. I'm not sure just how bad the performance would be handling code units, but I opened an issue a while back on my text seeking utility library: tilgovi/dom-seek#1

If someone wants to implement that we can benchmark.

danielweck · 2016-08-22T15:35:02Z

oh, nice one :)
https://github.com/tilgovi/dom-anchor-text-position/blob/master/src/index.js

tkanai · 2016-08-23T01:24:54Z

@danielweck Strictly speaking, EPUB CFI spec does not support any codes in surrogate blocks, and it would be natural for EPUB CFI to use Unicode code unit. I think textPositionSelector based on Unicode code point could be a good option for those who would like to apply Unicode code point basis selector for EPUB.

danielweck · 2016-08-23T10:33:32Z

Thanks @tkanai I agree that higher-level processing on Unicode 'code points' basis has its benefits (notably, text selections / character ranges are functionally closer to how human-readable languages / scripts are structured), but I was wondering about implementation feasibility and costs (in particular: performance).

The use of UTF16 'code units' in EPUB3 CFI is consistent with the overall "low level" design (e.g. canonical syntax for XML element path based on numbered node references). So yes, CFI character ranges are totally unaware of Unicode "subtleties" such as grapheme clusters and surrogate pairs, which means that a CFI-authoring user interface must capture and constrain/adjust text selections in such a way that they make logical sense from the user's perspective (whilst the underlying CFI processor itself does not need to be "Unicode aware" to that degree). Web browsers implement high-level text selection pretty well already, so the responsibility of a typical CFI processing library basically boils down to handling the low-level UTF16-aware (UCS2) output from DOM Ranges or JavaScript string API (no need for sophisticated Punycode -like Unicode utilities).

So, I am by no means claiming that the CFI model is applicable / superior to TextPositionSelector, I am just wondering about the pros and cons s :)

azaroth42 · 2016-09-30T17:17:26Z

I think we can close the issue as a duplicate of all the previous times we've gone around on code point vs code unit? :)

danielweck · 2016-09-30T19:07:30Z

sure :)

aphillips · 2016-09-30T19:20:55Z

@azaroth42: +1

While working in code points is awesome, the reality of the Web is often that of UTF-16 code units because of DOM String. While the APIs and data structures based on UTF-16 code units do not directly insulate users from problems with surrogate pairs (and, neither surrogates handling nor code point counting deal at all with grapheme clustering), proper character handling can and should still be provided by higher level implementation and protocols.

No process needs to deal with surrogate code points (that is, character values in the range U+D800 to U+DFFF). There is no reason to state, however, that, just because offsets are defined in UTF-16 code units that a process cannot handle supplementary characters (i.e. characters represented by a surrogate pair of code units)

I18N WG commented about an identical issue at TPAC, but I'm at a loss to put my finger on it just now.

azaroth42 · 2016-09-30T19:22:25Z

Thanks both!

danielweck mentioned this issue Aug 22, 2016

Ambiguity in interpretation of definition for epubcfi character offset w3c/epub-specs#555

Closed

azaroth42 added the i18n-review label Sep 30, 2016

azaroth42 closed this as completed Sep 30, 2016

danielweck mentioned this issue Oct 11, 2016

What is the navigator pointer/location approach in Readium-2 readium/architecture#9

Closed

plehegar added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. and removed i18n-review labels Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextPositionSelector, thoughts about Unicode code point vs. UTF16 code unit #350

TextPositionSelector, thoughts about Unicode code point vs. UTF16 code unit #350

danielweck commented Aug 22, 2016 •

edited

Loading

iherman commented Aug 22, 2016

azaroth42 commented Aug 22, 2016

tilgovi commented Aug 22, 2016

danielweck commented Aug 22, 2016

tkanai commented Aug 23, 2016

danielweck commented Aug 23, 2016

azaroth42 commented Sep 30, 2016

danielweck commented Sep 30, 2016

aphillips commented Sep 30, 2016

azaroth42 commented Sep 30, 2016

TextPositionSelector, thoughts about Unicode code *point* vs. UTF16 code *unit* #350

TextPositionSelector, thoughts about Unicode code *point* vs. UTF16 code *unit* #350

Comments

danielweck commented Aug 22, 2016 • edited Loading

iherman commented Aug 22, 2016

azaroth42 commented Aug 22, 2016

tilgovi commented Aug 22, 2016

danielweck commented Aug 22, 2016

tkanai commented Aug 23, 2016

danielweck commented Aug 23, 2016

azaroth42 commented Sep 30, 2016

danielweck commented Sep 30, 2016

aphillips commented Sep 30, 2016

azaroth42 commented Sep 30, 2016

TextPositionSelector, thoughts about Unicode code point vs. UTF16 code unit #350

TextPositionSelector, thoughts about Unicode code point vs. UTF16 code unit #350

danielweck commented Aug 22, 2016 •

edited

Loading