Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uncommon_codepoints: lint against 00B7 MIDDLE DOT in final position #120695

Conversation

Jules-Bertholet
Copy link
Contributor

@Jules-Bertholet Jules-Bertholet commented Feb 6, 2024

00B7 MIDDLE DOT '·' is considered by UTS 39 to be "Allowed - Inclusion" due to its presence in UTR 31 Table 3a - Optional Characters for Medial; as such, it is unsuitable for appearing in the final position of identifiers.

@rustbot label T-compiler A-diagnostics A-unicode

00B7 MIDDLE DOT '·' is considered by
[UTS 39](https://unicode.org/reports/tr39/#Identifier_Status_and_Type)
to be "Allowed - Inclusion" due to its presence in [UTR 31 Table 3a - Optional Characters for Medial](https://www.unicode.org/reports/tr31/#Table_Optional_Medial);
as such, it is unsuitable for appearing in the final position of identifiers.
@rustbot
Copy link
Collaborator

rustbot commented Feb 6, 2024

r? @pnkfelix

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-Unicode Area: Unicode A-diagnostics Area: Messages for errors, warnings, and lints labels Feb 6, 2024
@Noratrieb
Copy link
Member

Do we want to add special cases here? I would be surprised if middle dot was the only special case

@Jules-Bertholet
Copy link
Contributor Author

Jules-Bertholet commented Feb 6, 2024

MIDDLE DOT is the only Unicode character with this particular set of special circumstances:

  • is XID_Continue (specifically it is Other_ID_Continue, meaning that it normally would not qualify for use in identifiers based on its other Unicode properties, and is accepted only to preserve backward compatibility),
  • Is Allowed - Inclusion by UTS 39 due to listing in UTR 31 Table 3a as an optional medial character.

@Jules-Bertholet
Copy link
Contributor Author

Another special case that could be added to the lint is proper handling of ZWJ/ZWNJ, as recommended by UTS 55.

@pnkfelix
Copy link
Member

pnkfelix commented Feb 7, 2024

What are we trying to accomplish by allowing an interpunct (·, aka middle dot) with no linting whatsoever, in the middle of identifiers but not at the end?

I understand that the grammar of UAX31 indicates that medial characters are not meant to ever occur at the end of an identifier.

But we also explicitly have adopted the profile from UAX31 that leaves <Medial> empty. (And, I think, have implicitly put 00B7 MIDDLE DOT '·' into XID_Continue, right?)

  • Update 2024-02-08: I just confirmed that the lexer does indeed reject symbols with 00B7 as the initial character

In any case, I am not yet sure why we wouldn't just lint against this character (as an "uncommon unicode character") regardless of whether it occurs at the beginning, middle, or end of a symbol.

From my point of view, either its presence is confusing (and will be confusing regardless of context), or its presence is not confusing (regardless of context).

  • (The only case where I might see an exception to that point of view, where context would affect where it is confusing, is for native Catalan speakers?)

@pnkfelix
Copy link
Member

pnkfelix commented Feb 7, 2024

Another special case that could be added to the lint is proper handling of ZWJ/ZWNJ, as recommended by UTS 55.

Maybe I misunderstand what you are proposing; I thought Rust already rejects ZWJ/ZWNJ in identifiers (reference). Are you talking about other contexts where we need to lint that?

@Jules-Bertholet
Copy link
Contributor Author

Jules-Bertholet commented Feb 7, 2024

Rust already rejects ZWJ/ZWNJ in identifiers

Yes, and this is more strict than what the standard recommends. The recommendation is to allow it in a few special cases, and lint against it everywhere else.

@estebank
Copy link
Contributor

estebank commented Feb 7, 2024

ZWJ is indeed already not accepted as part of identifiers, it is considered whitespace by the lexer.

Could you expand on why 00B7 needs to be considered an uncommon codepoint only if it appears at the end of an identifier?

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a0d3e01111fffda4d0ff6f04923fb774

@Jules-Bertholet
Copy link
Contributor Author

It is considered whitespace by the lexer.

No, it is not valid whitespace either (does not have the Pattern_White_Space Unicode property).

Could you expand on why 00B7 needs to be considered an uncommon codepoint only if it appears at the end of an identifier?

It doesn't matter to me either way, I guess it depends on how strongly we value supporting Catalan. I can change the PR to lint against it in all positions if there's consensus in that direction.

@Manishearth
Copy link
Member

I don't think this is correct:

MIDDLE DOT is the only Unicode character with this particular set of special circumstances:

  • is XID_Continue (specifically it is Other_XID_Continue, meaning that it normally would not qualify for use in identifiers based on its other Unicode properties, and is accepted only to preserve backward compatibility),

I'm not sure what you mean by Other_XID_Continue. Do you mean that it is Other_Punctuation? There's a second character with this behavior: U+30FB KATAKANA MIDDLE DOT. See this set.

  • Is Allowed - Inclusion by UTS 39 due to listing in UTR 31 Table 3a as an optional medial character.

I don't think that's due to that table at all. I think it's direclty because it's used in Catalan.


Yes, and this is more strict than what the standard recommends. The recommendation is to allow it in a few special cases, and lint against it everywhere else.

Yeah, this was a deliberate choice: I didn't want to bog down the original RFC with the careful handling of ZW(N)J, so I left it an open question for the future. The lack of ZW(N)J should not be a signal about whether other thing should or shouldn't be allowed, since we could totally allow them in the future.


I definitely think we should take a more principled approach here. There's #120228 and it already proposes having a separate sub-lint that deals with confusables with syntax. While that issue doesn't itself ask for splitting lints, this would be a good reason to split the lint.

I think middle dot counting as "uncommon codepoints" goes against the spirit of the lint, which is about non-linguistic content. I would rather not single out Catalan on this until we have a principled approach. If people feel strongly about linting about this the lint should then catch the middle dot in all positions not surrounded by ls, since in Catalan the only legal place for the interpunct is in <l·l>. But my preference is to first fix #120228 (without changing what is linted about) and then we can get better at expanding sub-lints.

@Jules-Bertholet
Copy link
Contributor Author

Jules-Bertholet commented Feb 7, 2024

I'm not sure what you mean by Other_XID_Continue.

Other_ID_Continue, sorry

  • Is Allowed - Inclusion by UTS 39 due to listing in UTR 31 Table 3a as an optional medial character.

I don't think that's due to that table at all. I think it's direclty because it's used in Catalan.

Were that the case, it would be Allowed - Recommended, not Allowed - Inclusion (AIUI).

If people feel strongly about linting about this the lint should then catch the middle dot in all positions not surrounded by ls, since in Catalan the only legal place for the interpunct is in <l·l>.

How do you even define what is an "l"? What about case, diacritics, mathematical style variants, fullwidth etc…

@Manishearth
Copy link
Member

How do you even define what is an "l"? What about case, diacritics, mathematical style variants, fullwidth etc…

l or L, Catalan doesn't have diacritics on its ls, and the rest are not linguistic content.

@Manishearth
Copy link
Member

Other_ID_Continue, sorry

Ah, forgot about that.

Were that the case, it would be Allowed - Recommended, not Allowed - Inclusion (AIUI).

Ah, fair.

The background for this is actually probably that a lot of this standard is driven by the needs of domain name stuff, where the interpunct in particular is a bit of a problem. I think it's less of a big deal for us.

I wouldn't take Inclusion as too much of a signal here. That ontology needs a bit more work on defining what people should use these sub-properties for, it's a bit of a mess right now. We've got plans to improve that but that'll take time.

@Jules-Bertholet
Copy link
Contributor Author

Jules-Bertholet commented Feb 7, 2024

There's a second character with this behavior: U+30FB KATAKANA MIDDLE DOT. See this set.

I think KATAKANA MIDDLE DOT is Allowed - Inclusion due to its listing in IDNA2008/RFC5892. (It is also Other_ID_Continue.)

@pnkfelix
Copy link
Member

pnkfelix commented Feb 8, 2024

By the way, @Jules-Bertholet , I do want to thank you for bringing this to our attention.

I don't know where this conversation will end up. I'm planning to tag this as nominated for T-lang design since that team has the final call when it comes to potentially contentious lint decisions. But no matter where it ends up, it is super useful to know about the issue.

@pnkfelix
Copy link
Member

pnkfelix commented Feb 8, 2024

Actually, the more I reflect on this, I think @Manishearth has the best guidance here: We shouldn't do anything about 00B7 until we have the more principled approach in place, i.e. we should resolve #120228 first, and then land any changes here on top of the framework introduced there.

In that spirit, I am going to close this PR, but I want to thank @Jules-Bertholet again for bringing this to our attention. And I will also open up an issue, so that we can track the follow-up work item once #120228 is resolved.

@pnkfelix
Copy link
Member

pnkfelix commented Feb 8, 2024

As noted in previous comment: I'm closing this PR (because I think the work, if we do it at all, would be better suited to be done atop PR #120228), and I have filed #120797 as a follow-up to ensure we don't forget about considering the question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-Unicode Area: Unicode S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants