Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes. #537

Open
Tavmjong opened this issue Aug 22, 2018 · 62 comments
Open
Assignees
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. Needs editing Needs tests Text chapter

Comments

@Tavmjong
Copy link
Contributor

SVG 1.1 dictates that 'x', 'y', etc. values apply to characters (as defined in XML) in the description of these attributes. See:

https://www.w3.org/TR/SVG11/text.html#TextElement

https://www.w3.org/TR/SVG11/text.html#TSpanElement

SVG 1.1 also dictates that the number returned by getNumberOfChars() should be a count according to DOM 3. Non-rendered characters are to be included in the count.

https://www.w3.org/TR/SVG11/text.html#InterfaceSVGTextContentElement

DOM 3 defines strings as 16 bit units. Non-base plane Unicode points consist of two units. See:

https://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#DOMString

Normative text the SVGTextContentElement section appears to contradict that in the section discussing attribute value mapping in the <tspan> section in that the formal uses UTF-16 units while the latter uses XML characters.

SVG 2 attempts to make some clarifications:

An 'addressable character' is defined which applies to both mapping attribute values to characters and to character counting by getNumberOfChars():

https://www.w3.org/TR/SVG2/text.html#Definitions

Addressable characters are counted in units of UTF-16 code units and after white-space collapsing. They do not include characters in elements with 'display' value of 'none'.

https://www.w3.org/TR/SVG2/text.html#Definitions

Both these conditions seem to be a change from SVG 1.1 with respect to attribute value mapping.

https://www.w3.org/TR/SVG2/text.html#TSpanNotes

https://www.w3.org/TR/SVG2/text.html#InterfaceSVGTextContentElement

Tests

Test of UTF-16 code point counting:

http://tavmjong.free.fr/SVG/positioning-001.svg

Firefox 61 passes this test, Edge 15, Chrome 50/68, Android 5, iOS 8.3 fail. The latter do mapping by Unicode code points. Firefox and Chrome (only ones tested) do report the correct number of characters from getNumberOfChars() (see JavaScript console output).

I think most developers would expect mapping to be by Unicode code points and propose that we ask Firefox to switch and change the spec.

Test of effect of 'display:none':

http://tavmjong.free.fr/SVG/positioning-002.svg

Firefox 61 fails this test (as does Inkscape which doesn't handle 'display:none'). Edge 15, Chrome 50/68, Android 5, iOS 8.3 pass. Firefox includes characters in an element with 'display:none' in the count.

I believe Firefox does the right thing here. The mapping of values in an attribute shouldn't depend on a CSS value that can be changed at whim. However, as Firefox is the odd one out, it might be more prudent to follow the behavior of the other browsers and the spec as currently written.

@css-meeting-bot
Copy link
Member

The SVG Working Group just discussed Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes, and agreed to the following:

  • RESOLUTION: Assignment of multi-value text layout attributes (x, y, dx, dy, rotate) should be according to Unicode codepoint characters, not UTF-16 blocks.
  • RESOLUTION: Do not change the spec for character values with regards to display: none.
The full IRC log of that discussion <krit> topic: Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes
<krit> GitHub: https://github.com//issues/537
<krit> Tav: did more investigation. SVG 1.1, what browsers but Firefox do the counting is done by unicode code points and not by UTF16
<krit> AmeliaBR: so an emotion just gets one value of the attribute?
<krit> Tav: look down in the issue an there are a couple of tests. The 1st examples demonstatets that.
<krit> Tav: see the 5 chars at the bottom and they get positioned by unicode points.
<krit> http://tavmjong.free.fr/SVG/positioning-001.svg
<AmeliaBR> s/emotion/emoji/
<krit> Tav: so the empty boxes mean you don't have the necessary font installed
<krit> AmeliaBR: in chrome the red and black versions don't line up. What does it mean?
<krit> Tav: if they line up then the chars get positioned following UTF16
<krit> AmeliaBR: seems to be the useful thing to do esepecailly since most implementations do this
<krit> krit: doesn't work in Ai yet because of UTF16 issues on import
<krit> AmeliaBR: you can try to use entities so that you can comment on the issue what Ai is doing
<krit> krit: will do
<krit> AmeliaBR: what about the DOM methods
<krit> Tav: if you open the console for the tests and look at the output... the DOM methods DO use UTF16.
<krit> AmeliaBR: which seems less useful
<krit> AmeliaBR: there are other DOM methods which read back the actual position and show how the actual layout happens. Those would still use UTF16 but would match up what actually is going to get used.
<krit> AmeliaBR: we need to clearly test what browsers are doing but if browsers use actual unicode characters then...
<krit> krit: To clarify: Browsers use unfixed code points for actual layout but UTF16 for DOM methods?
<krit> Tav: sounds correct.
<krit> AmeliaBR: If you say "give me character 2" it should return which character it is part of taking glyphs and everything into account already.
<krit> AmeliaBR: including UTF16 encoding
<krit> Tav: I think SVG 1.1 actually specs as browsers but Firefox implement it.
<krit> Tav: I think Cameron added a clarification.
<krit> krit: What testing is missing? Tav seemed to have some part of it.
<krit> AmeliaBR: I'd like to see the other DOM methods that read back position cross-browsers.
<krit> AmeliaBR: especially if there are compatibility issues for files that were exported to SVG and now would get positioned incorrectly on reading back.Mostly the non-web use cases.
<krit> krit: Ai would not use any of the DOM methods but I can provide feedback to the visual output.
<krit> AmeliaBR: the description should match up with SVG 1 and most implementations.
<krit> krit: we could reconsider later and resolve now.
<AmeliaBR> Proposed: Assignment of multi-value text layout attributes (x, y, dx, dy, rotate) should be according to Unicode codepoint characters, not UTF-16 blocks.
<krit> RESOLUTION: Assignment of multi-value text layout attributes (x, y, dx, dy, rotate) should be according to Unicode codepoint characters, not UTF-16 blocks.
<krit> AmeliaBR: there was another part of the issue how to collapse whitespaces. Any proposal on that one?
<krit> Tav: from a user perspective: if you change a CSS property the characters move around.
<krit> AmeliaBR: that is consistent how display: none works in CSS layout. In comparison visibility: hidden.
<krit> Tav: the hidden one would use a gap in the text
<krit> AmeliaBR: so you thing there should be a way where automatic layout adjusts but per character markup still applies regardless of the overall layout
<krit> Tav: yes
<krit> AmeliaBR: especially on manual kerning you wouldn't want to match the characters to other characters.
<krit> Tav: right. This is unpredictable in some cases.
<krit> Tav: markup values should be interpreted differently from CSS layout ideally. From a practical use case it might not be relevant.
<krit> Tav: the fact that every one does it as speced except Firefox it might not make it worth changing anyway.
<krit> krit: in the future there are alternatives to kerning with CSS but positioning characters individually is still popular like for iWorks on the cloud.
<krit> AmeliaBR: The workaround would be to put the char positioning on the individual span elements directly rather than the top text element. Would help on display none.
<krit> AmeliaBR: I agree with your conclusion that we should follow the majority of implementations.
<krit> Tav: that is how it is speced in SVG2.
<krit> AmeliaBR: ...and follows previous resolutions.
<krit> proposed RESOLUTION: Do not change a previous resolution for character values with regards to display: none.
<krit> AmeliaBR: could you check if there might be issues on Firefox?
<krit> RESOLUTION: Do not change the spec for character values with regards to display: none.
<AmeliaBR> Here's a Firefox issue re display: none https://bugzilla.mozilla.org/show_bug.cgi?id=1141224
<AmeliaBR> Will need a new issue once the spec for unicode vs UTF-16 is ready.

@AmeliaBR
Copy link
Contributor

AmeliaBR commented Sep 1, 2018

Hi Tav. I'm reviewing the issues related to your open PR, and I noticed this relevant comment you made back in 2016:

The definition of "character" is from SVG 1.1. I believe it is meant to correspond to a Unicode point. In terms of input, a 'u' with a combining '`' would be two points while using the preformed 'ù' is one point. This has mostly to do with how the 'x', 'y', ... attributes are matched to the input.

Did you do any tests about whether browsers normalize these types of strings before assigning layout attribute values?

@AmeliaBR
Copy link
Contributor

AmeliaBR commented Sep 1, 2018

Ok, to answer my own question, here's a test: https://codepen.io/AmeliaBR/pen/72dceba63f82c5433ee6ca6f8be5304a/

The first two accented characters are single codepoint characters, the second two use combining characters, and the final one is just me stacking a whole bunch of combining characters together.

Results:

  • Firefox (v63) treats the combined accent+base character in the way the spec defines for ligatures: it gets laid out as a single character, but a dy value is consumed by the accent and accumulates in the position of the following character.
  • Chrome (v68) and Safari (v11) lay out the combining accent as its own character, offset by the dy value assigned to it.
  • Edge (v17) lays out the combined accent+base character the exact same as the single-codepoint accented characters (one unit, one dy value), even for accent combinations that can never be normalized to a single codepoint.

I'm going to make a firm argument that the Chrome/Safari behavior (shifting the combining accent relative to its base character) is wrong. But I'm not sure whether there is any interest in adopting the Edge behavior, which is probably more intuitive for authors, at least for the cases where the combining accent looks identical to a single codepoint version.

@fsoder
Copy link

fsoder commented Sep 1, 2018

From Blink's PoV I'd agree that the current behavior is incorrect, and I think we'd be happy to adopt the Edge behavior (assuming that is "per grapheme cluster".) I agree with the statement that that behavior is likely the most intuitive for authors.

@AmeliaBR
Copy link
Contributor

AmeliaBR commented Sep 3, 2018

In order to spec the Edge behavior, we'd need a clear way to define how characters (codepoints) get grouped into layout units. And it would need to be something that can be un-ambiguously defined solely by the character content, not based on a font. (Because we wouldn't want the assignment of layout attributes to characters to vary according to which font is used.)

Does the "typographic character" term as defined in CSS 3 & referenced in SVG 2 meet that requirement? Or does it also include font-specific groupings such as ligatures? @Tavmjong @fantasai

Maybe it would be better to directly reference Unicode character classes, e.g. to specify that combining/modifier codepoints get skipped over when assigning layout characters. The result might not be as smart about language-sensitive groupings, but it would be more clearly testable for consistent results between user agents.

@AmeliaBR
Copy link
Contributor

AmeliaBR commented Sep 3, 2018

Not an expert on Unicode, but I think what we'd want is the definition of a "base character":

Base Character. Any graphic character except for those with the General Category of Combining Mark (M). (See definition D51 in Section 3.6, Combination. [PDF]) In a combining character sequence, the base character is the initial character, which the combining marks are applied to.

So then the SVG rules would assign attribute values to the Unicode base characters in the text. Actual layout would still need special rules for ligatures and other clusters which are laid out as a whole based on font-specific rules.

Upside: This is a good balance of intuitive and unambiguous.
Downside: There's no easy JS way (as far as I know) to identify how many "base characters" there are in a string.

@fsoder
Copy link

fsoder commented Sep 3, 2018

I think using (extended) grapheme cluster (UAX#29) would be better in that case. They can be determined from code points. The "determine from JS" bit isn't solved there yet (I think...), but there are proposals [1] ( and probably polyfills) to do make that functionality available.

[1] https://tc39.github.io/proposal-intl-segmenter/

@svgeesus
Copy link
Contributor

typographic character (from CSS Text 3 and SVG2) is identical to UAX29 extended grapheme cluster:

Unicode Standard Annex #29: Text Segmentation defines a unit called the grapheme cluster which approximates the typographic character. A UA must use the extended grapheme cluster (not legacy grapheme cluster), as defined in [UAX29], as the basis for its typographic character unit.

@css-meeting-bot
Copy link
Member

The SVG Working Group just discussed Character counting on dx/dy properties, and agreed to the following:

  • RESOLUTION: Complex script should be rotated and moved together
The full IRC log of that discussion <krit> topic: Character counting on dx/dy properties
<krit> GitHub: https://github.com//issues/537
<krit> chris: Correct behavior is what Edge does I think
<krit> chris: (describes Edge behavior as written in the issue)
<krit> chris: This is using CSS3 Text typographic characters.
<krit> chris: are all implementations agreeing to do what edge does.
<krit> Tavmjong: that requires that you have a library or some way of knowing what clusters go together
<krit> Tavmjong: So ppl with different libraries should have the same behavior.
<krit> Tavmjong: for predictability, using unicode characters might be more predictable.
<krit> chris: kind of
<krit> AmeliaBR: it is more author friendly to use typographic character but adds more complication to implementation
<krit> Tavmjong: I don't know of a library that would be able to do this right now.
<krit> AmeliaBR: many rendering implementations that need 3rd-party libraries are in the same position.
<krit> Tavmjong: we use Tango.
<krit> chris: I think Tango supports it. At least Freetype does that.
<AmeliaBR> s/Tango/Pango/
<chris> https://mail.gnome.org/archives/gtk-app-devel-list/2008-May/msg00083.html
<krit> Tavmjong: my guess is what Pango does is relying on the information in the font
<krit> chris: hm, not so sure if that is the case.
<chris> suggest asking Behdad
<krit> Tavmjong: I am not convinced of the Edge behavior right now. I don't think we can implement it right now.
<krit> Tavmjong: My testing showed that everyone was using unicode points.
<krit> AmeliaBR: Based on my browser testing, Edge is the only one using glyph clusters, FF uses unicode character but lays them out by glyphs, Blink and WebKit separate accents from their base characters.
<krit> RESOLUTION: Complex script should be rotated and moved together
<krit> Tavmjong: the question is about counting now.
<krit> krit: do you think you can check and get back to the WG with your implementation results?
<krit> Tavmjong: I think I can get some data.
<chris> I think Harfbuzz does the UAX29 segmentation https://lists.freedesktop.org/archives/harfbuzz/2015-September/005083.html
<krit> AmeliaBR: maybe ping a few other ppl on the issue for feasibility. I think FF uses Pango on some platforms too.
<krit> Tavmjong: that would be a changed behavior to SVG 1.1
<krit> chris: yes it would
<krit> AmeliaBR: we have inconsistency anyway.
<krit> Tavmjong: I'll look into it by next week
<krit> AmeliaBR: we already has resolutions on the simpler cases

@dirkschulze
Copy link
Contributor

From Adobe's perspective we would prefer a definition that works cross specification. CSS3 typographic characters seems to make most sense.

@Tavmjong
Copy link
Contributor Author

@r12a Could you comment on this issue?
We are debating between two ways of counting characters:

  1. Using Unicode Code points (as per SVG 1.1).
  2. Using Extended Grapheme Clusters (EGC) per UAX#29.
    I'm concerned that (2) requires SVG renderers to be able to determine EGC's for all scripts in order to reliably apply the attributes.

@Tavmjong
Copy link
Contributor Author

CSS 3 Typographic characters: https://www.w3.org/TR/css-text-3/#characters
Note that Example 1, second point, gives a case where the Typographic Character is different depending on the operation (spacing vs. line-breaking). Which definition would apply to mapping attribute values?

@css-meeting-bot
Copy link
Member

The SVG Working Group just discussed Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes.

The full IRC log of that discussion <AmeliaBR> Topic: Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes
<AmeliaBR> github: https://github.com//issues/537
<AmeliaBR> Tav: After investigation, I'm even less comfortable switching to "extended grapheme cluster" or "typographic character" for counting, because it's just not clearly defined.
<AmeliaBR> ... Definitions can vary according to the particular use case, e.g. line breaking vs layout.
<AmeliaBR> ... Nice in principal, but needs a really expert approach.
<AmeliaBR> Chris: So how would authors handle e.g., combining accent characters? Would they need to insert, e.g., and extra 0 value in dx?
<AmeliaBR> Tav: Yes, as defined in SVG 1.
<AmeliaBR> Chris: It's a bit of a pain for authors, who often don't have transparency about whether the platform is using precomposed accents or not.
<AmeliaBR> ... Have you got a response back from Behdad?
<AmeliaBR> Tav: Not yet. I also pinged @r12a (Richard Ishida) on the issue.
<AmeliaBR> Amelia: Goal should be to balance best for linguistics with something that can be reliably implemented.

@css-meeting-bot
Copy link
Member

The SVG Working Group just discussed Update on character counting for layout attributes on text/tspan elements, and agreed to the following:

  • RESOLUTION: Keep unicode code point for now until we get feedback from implementers. Keep previous resolution.
The full IRC log of that discussion <krit> topic: Update on character counting for layout attributes on text/tspan elements
<krit> krit: Tav, saw you discussed on the mailing list?
<krit> Tavmjong: got a response. He said you can get the breaking points from pango. Still not sure if that is the easiest way to do. He wants to avoid the CSS wording because that depends on context.
<krit> Tavmjong: The spec says that breaking may depend on context. So you'd break at different points. Unicode might work but I'd not say this is the way to go.
<krit> krit: would you approve if we use CSS and ask to clarify what context awareness means?
<krit> Tavmjong: if we have.a set of numbers that need to apply to the same groups of chars. And CSS3 it depends on the context.
<krit> krit: it'd be great to understand what the context is
<krit> AmeliaBR: we had this issues with white space collapsing.
<krit> Tavmjong: In the email says that he likes the Edge behavior better with the cluster selection. Pango returns an array that are well defined clusters in unicode.
<AmeliaBR> Behdad's reply: https://lists.w3.org/Archives/Public/www-svg/2018Sep/0018.html
<krit> Tavmjong: if we are to switch from unicode code points, it would be the thing to switch too.
<krit> krit: could it happen that :first-letter selector can have a different meaning on layout and rendering?
<krit> AmeliaBR: :first-letter has a different set of settings.
<krit> AmeliaBR: for layout it is a different and predictability is more important.
<krit> Tavmjong: my suggestion would be to leave unicode code points and add a note with a request of comments from implementers.
<krit> AmeliaBR: the limitation of leaving would be the inconsitencies and we can not file bugs on browsers until we decided how to go forward.
<krit> krit: The CSS has more text experts... is that something we should bring it up there or is it completely independent of CSS and its definition of typographic characters?
<krit> Tavmjong: I think it is independent. We can not use typographic chars from CSS since you might break at different positions dependent on the context. That would be unpredictable and not consistent.
<krit> Tavmjong: so either use code points or Edge's behavior of clusters. (which is not known detaul)
<krit> s/detaul/detail.
<krit> chris: surrogates are in UTF16 and 2 sets allow defining one character and older implementations do not understand this
<krit> AmeliaBR: this is how we even got into it
<krit> Tavmjong: only FF supports this but no one else.
<krit> AmeliaBR: we are going to file issues against specs. We need to decide on it to fix other issues.
<AmeliaBR> github: https://github.com//issues/537
<chris> https://en.wikipedia.org/wiki/UTF-16#U+10000_to_U+10FFFF
<krit> krit: Maybe going with Tavs proposal and ask for browser input would unblock us for now.
<krit> krit: How can we bring this to their attention?
<krit> AmeliaBR: Tav, could you go though the text that may need changes and show how it would affect output if we are going to change?
<krit> Tavmjong: I can create a PR with the changes.
<krit> RESOLVE: Keep unicode code point for now until we get feedback from implementers. Keep previous resolution.
<krit> RESOLUTION: Keep unicode code point for now until we get feedback from implementers. Keep previous resolution.
<krit> AmeliaBR: Can it handle multi Byte characters and can it handle the 2nd issue?
<krit> chris: both do not affect western text
<krit> Tavmjong: emoji is a good example
<chris> s/do not/*do*
<krit> Tavmjong: some emojis use colors?
<krit> chris: exactly, you may need to combine characters.
<krit> Tavmjong: maybe w good way to test
<krit> s/maybe w/maybe a/
<krit> Tavmjong: chris, could you send me an example with emojis? Then I'd create a test out of it.
<krit> chris: yes

@r12a r12a added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Oct 2, 2018
@r12a
Copy link

r12a commented Oct 2, 2018

Sorry to be late to the party. (Btw, if you add an i18n-tracking label to an issue, it should pop up in our WG daily notifications, so others may have seen while i was travelling. That may help next time.)

This is an interesting discussion. I don't think i have a clear answer for you, but i may be able to help a little. You may have found it useful to refer to some new material, added recently to one of our articles, that describes code points vs grapheme clusters vs typographic character units, however i think you probably understand most of that stuff now. Note in particular that i believe you have correctly identified that the CSS typographic character unit is very contextually dependent. Here are a few other thoughts from me, off the top of my head...

First, i think it was always a BAD MISTAKE to ever define strings in terms of UTF-16 code units. In order to apply offsets per Amelia's test you'd have to be aware of which characters were supplementary chars and which weren't in order to create a step of characters. The same applies for counting things, since you never want to separate the two UTF-16 code units that make up a single character.

If you go with grapheme clusters, users may still get some odd effects unexpectedly. Take the following example in Bangla: kshī (ক্ষি) is made up of two grapheme clusters. If you were creating Amelia's stepped character display, you'd end up with

screen shot 2018-10-02 at 13 44 53

rather than all grouped together like

screen shot 2018-10-02 at 13 43 15

The reason this isn't taken care of by Unicode grapheme cluster rules is that it's tricky. What constitutes a user-perceived character in this case depends on which script is being used, and to an extent on what the font does too, since it's only a single user-perceived character if the sequence forms a conjunct (ie. the glyphs are combined into a unit).

Apart from that, I'd certainly like to be able to highlight code points sometimes rather than grapheme clusters - eg. when colouring diacritics or other combining characters in educational material, or even sometimes when explaining grapheme clusters to be able to colour each component part differently!

Of course, one encounters similar problems with code points. The stepped character display would look even worse if it showed up as

screen shot 2018-10-02 at 13 51 41

On the other hand, if you wanted to explain to someone what characters make up that conjunct (perhaps with horizontal movement rather than vertical) this could be quite useful.

It seems to me that perhaps a stepped character display like Amelia's test would probably always need to be hand crafted, so that the right things stick together(?)

However, counting characters is perhaps something else. As i said before, i wouldn't want to use UTF-16 code units for counting, any more than i'd use bytes. I also think that grapheme cluster counts don't give enough precision for some use cases, and it's possible that the rules for what constitute a grapheme cluster may be extended too in the future. I think that code points are probably the best way to go.

As far as emoji go, here we are entering a world where the question of what constitutes a unit becomes even further complicated. This is because an emoji picture can be made up of many component parts. Perhaps a useful example can be found in the slides i just put together for Paris Web – see the juggling girl and family emojis at https://www.w3.org/International/talks/1810-paris/index.html#truncation.

screen shot 2018-10-02 at 14 08 39

I don't know how helpful all that is, but hopefully a little.

@dirkschulze
Copy link
Contributor

I am including @litherum in this discussion. He participates in the process around the https://drafts.css-houdini.org/font-metrics-api Houdini specification. Maybe he has some additional feedback.

@dirkschulze
Copy link
Contributor

Including the spec authors of Font Metrics API, @eaenet and @kojiishi, to this discussion as well.

@litherum
Copy link
Contributor

We should standardize on either caret positions or grapheme clusters. Code units or code points are almost certainly wrong. There should be no difference between é and e + combining acute accent

@r12a
Copy link

r12a commented Nov 13, 2018

As far as I understood, @r12a, @Tavmjong and @fantasai support unicode code points because they are more robust and stable.

For general counting of characters, counting characters is often the best, since it's more reliable.

For automatically segmenting text in order to display it, i think i prefer Edge's behaviour, which seems to be using grapheme clusters augmented with tailoring rules to capture full indic conjunct-based orthographic syllables in devanagari. (I don't know what secret sauce they are using, but i added an extra test, and note that it also avoids treating Tamil consonant clusters as a unit (which is good, see w3c/iip#18 for details, if you need them)).

@Tavmjong
Copy link
Contributor Author

@litherum Inkscape makes heavy use of multiple values.

@dirkschulze
Copy link
Contributor

If we should not get to an agreement, we can keep the exact character counting algorithm unspecified explicitly. In this case we would (ideally) hint which output is preferred.

However, I really hope we get to a resolution that would get implemented interoperable eventually.

@r12a
Copy link

r12a commented Nov 29, 2018

Thinking more about this, i came across another potential issue. Complex scripts use things like RLI (specifies base direction for bidi text), ZWNJ (stops cursive joining in scripts like Arabic & breaks conjuncts in scripts like Devanagari), FVS (applies a specific variant shape to a Mongolian letter). These are all invisible characters in Unicode, and i believe that none of them are combined with other characters when grapheme cluster segmentation takes place.

If we are creating offsets by automatically counting grapheme clusters or code points, the result would presumably be a gap where one of these characters appears. Perhaps one way to deal with that is to establish an exclusion list for this type of character, however i don't know whether or not that would come with it's own problems.

Whatever is done wrt spatial placement of glyphs, the effects of those characters need to be applied to the appropriate adjacent character.

Here's another test that looks at what browsers do with the invisible characters above.

Firefox leaves a gap for ZWNJ and FVS, but not for LRE/PDF, but does apply the expected effects, except for bidi reordering (the AB is in the wrong place).
Chrome leaves gaps for all, and doesn't apply expected effects for ZWNJ and FVS, but does for bidi.
Edge leaves a gap for ZWNJ only, and applies expected effects for ZWNJ and FVS, but not bidi.

@r12a
Copy link

r12a commented Nov 29, 2018

Looking at the same test apart from the invisible characters, this test shows some interesting bidi behaviours.

  1. The right-to-left cascading seen in Edge and Firefox looks wrong to me. Even though these are RTL characters, i would have expected the offsets to all be from left to right. The changes in direction for "ab" seem odd and unwarranted.
  2. The order of characters and the placement of the 'ab' relative to the rest of the Hebrew text is different between Firefox/Edge and Chrome.

@r12a
Copy link

r12a commented Nov 29, 2018

I don't know what to suggest. This segmentation issue is more of a problem for this feature than for some other contexts, due to the fact that visual placement and separation is involved. Some options that come to mind are:

  1. Add a note to say that automatically positioning parts of a text string in this way is likely to be problematic for complex scripts, and authors may not be able to use this feature effectively much of the time for those scripts. That doesn't seem a very satisfactory solution for users.
  2. Change the syntax, so that the string in the text element becomes a list, where the content author groups things together as they want, rather than trying to figure things out automatically. That's probably not a welcome change to the spec.
  3. Do some more in-depth research around the segmentation process across various scripts to understand whether there is a standard set of rules that can be applied here so that things just work (Edge seems to get close already, but we need to do further testing, and find out what their secret sauce is.) This could take a while.

@BigBadaboom
Copy link
Contributor

Is it maybe time to assign this to the too-hard basket, and go with @r12a's option 1 and 2? That is, deprecate the multi-value attributes, and instead recommend that authors use <tspan> for the situations when the code point algorithm fails?

However that leaves a problem regarding the rotate attribute. I know the text layout algorithm is already quite complicated, but one possible solution would be to allow transform on <tspan> elements.

https://codepen.io/PaulLeBeau/pen/qQQPag

@fantasai
Copy link

I don't think this is too hard. You pair off coordinates and characters using codepoints, since those are stable for counting, and when there are multiple codepoints that belong to a single typographic character unit, you render them together like Edge does (ideally), handling ignored coordinates the same way ligatures are handled per @AmeliaBR’s comment above.

@fantasai
Copy link

See Tav's proposal in #260 (comment) ... this is what we should do (leaving codepoint vs UTF-16 byte pair to be sorted out by the SVGWG in consideration of compat).

@litherum
Copy link
Contributor

litherum commented Jan 2, 2019

The general problem of assigning each character a position in handwritten scripts is kind of meaningless. E.g. in Arabic, should we use Kashida to make the text flow from one character to the next? I wish I was around when this feature was being introduced, so I could have argued that authors should do their own counting by putting each character inside its own <text>. I wish we could deprecate this feature.

@svgeesus
Copy link
Contributor

svgeesus commented Jan 7, 2019

@BigBadaboom said:

Is it maybe time to assign this to the too-hard basket, and go with @r12a's option 1 and 2? That is, deprecate the multi-value attributes, and instead recommend that authors use for the situations when the code point algorithm fails?

@litherum said:

I wish I was around when this feature was being introduced, so I could have argued that authors should do their own counting by putting each character inside its own . I wish we could deprecate this feature.

Apart from tspan vs. text (and tspan is preferable so you can select across the element boundary) these seem similar to me. Deprecate the under-specified lists of dx and dy, and define them to use UTF-16 byte pairs for counting. Tell authors to use explicit markup when they want something different.

Leave some future spec to add any additional counting methods and text segmentation strategies and so on (tests from @r12a show there is little interop here, for the non-trivial cases). I do want to see this resolved, but I don't see a resolution (plus associated browser updates) inside a couple of months.

@AmeliaBR
Copy link
Contributor

AmeliaBR commented Jan 7, 2019

There is interop for single-byte characters. Which is an awful lot of web content, and this feature is probably even more commonly used in non-web content. Deprecating something that already works doesn't seem helpful.

@css-meeting-bot
Copy link
Member

The SVG Working Group just discussed Character counting, and agreed to the following:

  • RESOLUTION: counting by use real unicode code points (not UTF16 blocks) Ignore/combine attribute values that are assigned to code points that are clustered with a previous character
The full IRC log of that discussion <krit> topic: Character counting
<krit> github https://github.com//issues/537
<krit> github: https://github.com//issues/537
<myles> hi
<krit> myles: I think the issue has 3 proposals
<krit> myles: either use code units, graphim clusters or a complicated thing that takes properties into account
<krit> AmeliaBR: 1st option is UTF16 blocks or code points
<krit> AmeliaBR: no one seems to implement blocks
<krit> AmeliaBR: emojis would count as 2 for instance
<krit> myles: emojis are most compelling for geaphim clusters
<krit> AmeliaBR: none of the proposals were about breaking about complex glyphs for layout
<krit> chris: blocks are not ideal but I thought browsers use them
<krit> AmeliaBR: some do
<krit> myles: cahr position is not blcok or graphim cluster or anything. So not the best to describe the issue
<myles> dx="3 4 5 6"
<myles> "hi❤️k"
<krit> myles: if I got a string (typing above)
<krit> myles: heart is 2 code points
<krit> myles: 8 is going with 3 is what you saying_
<krit> chris: 5 is heart
<krit> chris: 7 would affect the k
<krit> AmeliaBR: if it is 2 code points it would accumulate into the next set of characters
<krit> myles: how do you know is 2 items in the list? because it has 2 cod epoints?
<krit> myles: how do you know what a unit is (like the heart)
<krit> chris: I know what a geaphim cluster is when I see it but technically it might mean different thigns.
<krit> chris: Tav for instance was scared that it might mean different things depending on the properties of the char of font
<krit> AmeliaBR: Especially ligatures that are font specific make it more difficulz
<krit> myles: ligatue in the font would still be 2 graphim cluster with multiple code points but that affects the rendeirng only
<krit> chris: If you have a ffi it would be one cluster
<krit> myles: it would force it to break the ligature
<krit> chris: I see
<krit> AmeliaBR: we have different rules in SVG for ligatures
<krit> myles: most natural way for things like arribic there is not a straight forward to do it. Best we can do is try to do what CSS for hyphenation does. You break the text at best place but breaking location is shaped as if it was not broken. Next to hyphenation you get the media form. We should use the same mechanism
<krit> chris: Does WebKit have accesss to it from CoreText engine?
<krit> myles: we have access to it and think we should do it
<krit> AmeliaBR: I d like to see the proposal written but is less important than counting
<krit> AmeliaBR: breaking a ligature differs between implementation than just that glyph is off
<krit> AmeliaBR: but if counting differs the entire text might look different
<krit> myles: with the proposal to ignore / limp in items in the list when it conforms to the first code point in the graphim cluster... In that proposal the exact boundaries of the cluster might not count much
<krit> myles: so next char might get into the same place
<krit> AmeliaBR: that is what unid code points make different
<krit> myles: do they would be local to where the engine chops of the text
<krit> myles: so heart would be at 5 and k would be at the same place still
<krit> AmeliaBR: if browser does not recognize a given emoji sequnece as a char, it would still do the counting consistent after the char
<krit> proposed RESOLUTION: counting use real unicode code points (not UTF16 blocks)
<krit> proposed RESOLUTION: counting by use real unicode code points (not UTF16 blocks) Ignore code points that are not part of a cluster group
<myles> proposed RESOLUTION: counting by use real unicode code points (not UTF16 blocks). Ignore code points that are not the first item in a cluster group
<myles> proposed RESOLUTION: counting by use real unicode code points (not UTF16 blocks). Ignore attribute values that are assigned to code points that are clustered with a previous code point
<AmeliaBR> proposed RESOLUTION: counting by use real unicode code points (not UTF16 blocks) Ignore/combine attribute values that are assigned to code points that are clustered with a previous character
<krit> RESOLUTION: counting by use real unicode code points (not UTF16 blocks) Ignore/combine attribute values that are assigned to code points that are clustered with a previous character
<krit> AmeliaBR: we should use a seperate issue to talk about the rendering proposed by myles
<krit> myles: I ll start the new issue
<krit> chair: krit
<krit> trackbot, end telcon

@r12a
Copy link

r12a commented Jan 10, 2019

clustered with a previous character

Sounds to me like that means "use grapheme clusters for segmentation and pairing of text units". Is that right?

@litherum
Copy link
Contributor

That isn’t right!

Each browser can choose whatever clustering they want. If different browsers choose different clustering, that particular character will be drawn differently, but the remainder of the string will still be shown in the right place.

@r12a
Copy link

r12a commented Jan 11, 2019

Not sure i understand how that works. If browsers cluster components of a text string in different ways, won't that mean that the the number of items to be displayed could differ from browser to browser, thus affecting the positioning of the content because we're dealing with offset movements ?

For example, in the examples at #537 (comment) the Firefox clustering yields 18 lines of text, whereas the Edge clustering yields only 14. What am i missing?

@AmeliaBR
Copy link
Contributor

@r12a

What's missing is that Edge's behavior would be incorrect according to the proposed spec clarification. In that example, the Firefox behavior would be (mostly) correct: each code point gets assigned a new value from the attributes, but for actual layout grapheme clusters are laid out as one.

If an author wanted the Edge rendering, they (or their authoring tool) would need to add extra placeholder values (0 relative change for dx/dy, or a repeated absolute value for x/y) for the codepoints that don't represent independent layout characters.

In general, the author is expected to position characters as makes the most sense for the language and design. The goal of the spec is to ensure the minimum possible discrepancy in renderings. Codepoints may not be the most intuitive, but they are unambiguous and will never change. (Unlike for example, newly introduced emoji clusters, which older browsers might not recognize as a single unit.) If the text rendering engine doesn't have a combined glyph for a character, the layout of those particular characters may be poor, but if all values are assigned by codepoint, then the remaining characters won't get shifted relative to the author's intent.

@litherum
Copy link
Contributor

Right. The number of items in the position array should be equal to the number of code points in the string; they are parallel. When the browser chunks up the text, it accumulates all positions that a grapheme cluster (or whatever method the browser uses for chunking) includes into the same chunk. That way, the next character is still shown at the same place as it would have been shown if the browser chose a different chunking method.

You can also think of it as the set of positions is turned into a set of absolute positions, so each code point corresponds to an absolute position. Then, all the positions that correspond with the non-first-code-point in the cluster are ignored, so after the end of the cluster, the next item is shown at appropriate absolute position.

@css-meeting-bot
Copy link
Member

The SVG Working Group just discussed Character counting in text attributes, and agreed to the following:

  • RESOLVED: Do not change the previous resolution.
The full IRC log of that discussion <mstange> Topic: Character counting in text attributes
<AmeliaBR> github: https://github.com//issues/537
<mstange> AmeliaBR: To recap: text attributes for positioning, as we've been discussing, can have multiple values that can be applied to multiple characters.
<mstange> AmeliaBR: The first step is that we look at the attribute and assign each value to a different DOM character. In some cases that's very simple.
<mstange> ... But if you have more complex multi-byte characters, things get more confusing.
<mstange> s/characters,/characters or clusters,/
<mstange> AmeliaBR: "What is a character" becomes a debate.
<mstange> ... There are other definitions that use utf-16 blocks, which is not very useful. But beyond that, do you use unicode codepoints or do you combine and cluster things so that you have a combining accent character, are those the same character or different characters?
<mstange> ... We have a resolution from January which resolved that values in the array should be assigned based on unicode codepoints.
<mstange> ... The argument for that is that unicode codepoints are stable and won't be affected by whether a new cluster gets introduced, or whether a particular font supports a particular combining unit.
<mstange> myles: Other part of the resolution: We count based on code points, but we don't segment based on code points.
<mstange> myles: Let's say we have the string of code units "A" "B" "heart emoji" "red combining character". Now we also have an array of positioning values with four elements.
<mstange> ... Now we need to come up with a mapping. There's to parts to this resolution: When you count, you count code point by code point. And the second part is: You're allowed to disregard any positions assigned to any combining characters, because the combining characters don't get rendered on their own.
<mstange> ... We didn't want to have a situation where regular characters following a combining character end up in the wrong position because they get assigned the wrong value from the position array.
<mstange> ... This ensures consistency between browsers.
<mstange> AmeliaBR: It's not intuitive for hand-authoring. But the upside is that, outside of browser differences of the graphing cluster, the rest of the layout stays consistent from browser to browser. Once you get past the cluster that has the discrepancy, everything else is the same.
<mstange> nmccully: Let's say there is a browser with a shaping engine and one that is not. The browser that has a shaping engine will <missed>. The browser that *doesn't* have a shaping engine will presumably manually get positioning information and the combining red would be passed by itself to a table that gives a space.
<mstange> AmeliaBR: There is a multi-stage lining up process. The way we're proposing is: positioning values to code points is a one-to-one matching. The next step (matching code points to your shaping) is where things can be discarded because of ligatures or combinations etc.
<mstange> r12a: There are two issues. One, e.g. the word réd can have two code point representations, and an author cannot immediately see which one is used.
<mstange> heycam: If the positional values are coming from a graphical editor, the graphical editor knows what code points are used and can generate the correct arrays.
<mstange> AmeliaBR: Yes, it's only the hand authoring case that's hard.
<mstange> myles: Alternatively, we could specify that <missed> goes through normalization.
<mstange> r12a: I do not like the idea of normalizing my content. Sometimes I want things to be composed and put the accents afterwards.
<mstange> myles: Pragmatic / performance, might not be worth it.
<mstange> r12a: The other issue is that, for example if you have some Persian, sometimes a zero-width joiner is used at the end of a word to produce the right shape. Then you need the two characters to stay together.
<mstange> ... and those are not a graphing cluster.
<mstange> heycam: You would ignore the positioning value for the zero-width joiner.
<mstange> AmeliaBR: If you're intentionally putting the ZWJ <missed>, you still want contextual glyph selection.
<AmeliaBR> s/<missed>/to change to a medial glyph/
<mstange> heycam: CSS properties can change the effect of ligatures and other combinations, and you wouldn't want to have to adjust your positioning value array based on that.
<AmeliaBR> s/<missed> goes/the text content/
<mstange> nmccully: If you need backward compatibility with engines that don't understand the red combining thing, <missed>
<mstange> myles: A lot of people said "there is incompatibility and we have to deal with it."
<AmeliaBR> s/engine will <missed>/engine will get fewer glyph clusters from the engine than there are characters/
<mstange> nmccully: Are you protecting from a malformed SVG from a bad player?
<mstange> nmccully: Different browsers might get different results for cluster segmentation. Users want to get consistent positions everywhere. So we can't count positioning values based on the results from cluster segmentation. So we have to do the matching based on something in the source.
<heycam> mstange: seems like whenever you have parallel arrays, you should just have a single array of pairs
<heycam> ... why is this API necessary?
<mstange> AmeliaBR / myles: There is existing content that uses it.
<mstange> r12a: In Indic scripts you have conjuncts, you split a syllable at a time.
<mstange> ... (shows an example that has two graphing clusters that combine into one visible unit)
<mstange> myles: This is why we didn't want to specify what segmentation to use.
<AmeliaBR> s/graphing clusters/grapheme clusters/g
<mstange> AmeliaBR: r12a, are you ok with keeping the existing resolution?
<mstange> r12a: It's not pretty, but I understand why it's there, and I can't figure out anything better.
<mstange> r12a: (shows a testcase that renders differently in Firefox and Chrome, where diagonal Arabic text is rendered top-right to bottom-left in Firefox and top-left to bottom-right in Chrome)
<mstange> RESOLVED: Do not change the previous resolution.
<AmeliaBR> RRSAgent, make minutes
<RRSAgent> I have made the request to generate https://www.w3.org/2019/09/18-svg-minutes.html AmeliaBR
<prushforth> CG meeting is on channel #svgcg

@heycam
Copy link
Contributor

heycam commented May 15, 2020

Someone brought up a test case that runs into this difference between Firefox (which still addresses individual UTF-16 code units, per the current spec text) and other browsers (which use whole Unicode characters). I just want to clarify that the previous resolutions are in line with Tav's original request to change Firefox to match Chrome and Safari by indexing on whole characters. I'm fine with that, and the patch to fix that is pretty easy. If that's right, I can file a PR to fix the "addressable character" definition.

@AmeliaBR could you confirm?

@aphillips aphillips added i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. and removed i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. labels Sep 5, 2023
@aphillips
Copy link

Several I18N needs-resolution issues were closed in favor of this issue. Adjusting our labels to track resolution accurately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. Needs editing Needs tests Text chapter
Projects
None yet
Development

No branches or pull requests

14 participants