-
-
Notifications
You must be signed in to change notification settings - Fork 36
Add section on Uniqueness and Equality #869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I oppose adding this, not because we shouldn't be explicit about how comparisons are done, but because this is not the right place to do it. We should be explicit about identifier equality and about string equality where these are actually used in the spec.
I also want to avoid requiring NFC at this level and in this way, because messages might not be in a Unicode encoding and because some implementers might object to being required to perform NFC inside of comparisons (vs. pre-normalizing values). There may also be functions that support non-normalized literals as operands or produce deliberately non-normalized output (for example, pseudo-translators that use combining accents to decorate ASCII might produce NFD output).
I would be more amenable to language such as:
I could see us adding:
Note well: I have a long (twenty-five plus) year history of wrestling with this issue and you can read the results of that in String Matching. I have no problem--and would encourage--requiring NFC (without case folding) in our namespace. But the cases need to be clear for implementers and we should not require implementers to be normalizing messages and strings on-the-fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the following are the "parts of the specification" that do string matching:
When putting this PR together, I did consider going into each of those and adding the NFC normalization there, but that's tricky in particular for the variable resolution and function lookup, as we've somewhat explicitly left their details out of the spec, so that an implementation can e.g. resolve
$foo.bar
by looking up thebar
property of afoo
object, or by deciding for itself how to look up the function for:html:img
.Would more explicitly enumerating here the list of spec parts where normalization happens be sufficient?
Hence this catch-all type of approach, which is intended to be sufficient to ensure that normalization is applied to comparisons, but it's not required to be externally visible, along with an explicit permission to solve the problem by normalizing everything.
I fear that the text you suggest would be misleading, as "code point-by-code point comparison" does not allow for normalization unless the strings in question have been previously normalized.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some history is probably warranted here. W3C I18N for many years championed a concept called "Early Uniform Normalization" (EUN), which, in a nutshell, said "normalize all data values to NFC near the point of input so that comparisons can be fast and efficient".
It turns out that there are practical problems ensuring this. There are lots of places where denormalized data can creep in, such that people end up having to at least check data values before comparison.
Normalization of an MF2 message should not just be on the whole of the message as a string. The grammar of MF2 contains literal text that needs to allow non-normalized code point sequences.
I misspoke in saying that each comparison point would be where to specify normalization. What I should have said (and meant to say) is that the grammar of MF2 is sufficiently tight that there is a single "choke point" where we need to talk about normalization of values and it is the production
name
.Notice that
name
is used to create variable names, option names, function names, unquoted literals, namespace names (and thus identifiers), and everything that isn't either an ASCII word (.input
, etc.) or some punctuation ({{
,}}
,{
, etc.) or whitespace.We can simply say that
name
MUST be NFC or, when converted from another character encoding, must be normalized to NFC. This ensures that matching never need to worry about normalization.What about literals? Unquoted literals use
name
, so they'll be NFC. But what about quoted literals? These can and should allow non-NFC sequences. We do not want to normalize these in order to allow non-normalized sequences or values, which are occasionally useful. Note well that the quotes (|
or{{
/}}
) around literal sequences are not part of the literal. Thus|\u0300|
does not treat the combining mark U+0300 as an extension of the|
grapheme. (This is why you cannot normalize an MF2 message as a whole.)Non-normalized literals, when used in an MF2 message as a value of a key, option, etc. behave as non-normalized values. They may be visually indistinguishable from normalized values and not match, a fact that is also true of lots of strings in Unicode that are normalized. This is rarely a problem (self-spoofing is a Bad Idea). Processing of MF2 messages needs to understand the boundary conditions when parsing.
This is not misleading at all: it explicitly does not allow normalization of the strings in question at the point of comparison. If we have enforced EUN (as described above), we've made it irrelevant. However, EUN of
name
imposes the cost of carrying around a normalizer and doing normalization checking on implementations.The alternative to EUN of name is to do what we've currently done: ignore the problem. This turned out to be what the Web (and internet at large) did, which is why charmod-norm is the way that it is. If we adopt that approach (or rather, keep that approach), then it is the responsibility of the user to ensure that their names are normalized (or not) and match each other (or not), because our grammar is normalization sensitive (just as it is case sensitive). There is no cost or burden on implementations in such a case, except as a source of frustration for end-users when values that are visually and semantically indistinguishable don't match.
While the Unicadett in me thinks NFC is the answer (and I fought for 20 years to make it the answer!), in practice I lost that battle and stuff mostly seems to work. If we "accept defeat" here too, we should insert text into our spec about here that says basically "avoid non-normalized name values: they work bad magick"