Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-ASCII identifiers #2457

Merged
merged 26 commits into from
Oct 29, 2018
Merged
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
ec728b3
Initial draft of unicode-idents RFC
pyfisch Jun 3, 2018
4c1bda9
Include expected Usage Notes and minor changes
pyfisch Jun 4, 2018
619f5b4
Improve descriptions and fix typos
pyfisch Jun 4, 2018
142d0bc
ACII -> ASCII
pyfisch Jun 4, 2018
6b2a94a
Typos, renames and a minor reference change
pyfisch Jun 7, 2018
3e19d26
Update Reference-level explanation
pyfisch Jun 8, 2018
a4830a1
Consider identifiers for confusable detection
pyfisch Jun 9, 2018
12d0623
Note difference between Python and Rust
pyfisch Jun 10, 2018
79bbc8e
Remove mention of scope from guide explanation
pyfisch Jun 10, 2018
41f0723
Rename confusable_non_ascii_idents to confusable_idents
pyfisch Jun 10, 2018
3c96d81
Conformance statement
pyfisch Jun 12, 2018
940dab5
Remove stray "is"
pyfisch Jun 12, 2018
da43d09
Add that non-ASCII idents observe UAX31-R3
pyfisch Jun 15, 2018
0e0ca66
Add details for fs, extern, lints
pyfisch Jun 16, 2018
935c917
Add two questions about debuggers and name mangling
pyfisch Jul 10, 2018
8d548d4
Add exotic codepoint detection and mixed script lints
pyfisch Aug 15, 2018
9356fc1
+ Reusability
Manishearth Oct 15, 2018
40d53f5
Global mixed script confusables lint
Manishearth Oct 15, 2018
7732810
notable code points for less_used_codepoints
Manishearth Oct 15, 2018
e3f3692
Mention user-supplied strings
Manishearth Oct 16, 2018
d389a9c
Add unresolved Q regarding const pat confusion (rust-lang/rust#7526).
pnkfelix Oct 19, 2018
70297a9
Remove old mixed scripts lints
Manishearth Oct 19, 2018
9bf90df
Add new mixed_script_confusables lint
Manishearth Oct 19, 2018
a6da03a
Add unresolved questions for RTL and terminal width
Manishearth Oct 20, 2018
c4dff64
Allow bare underscore identifiers
Manishearth Oct 20, 2018
0c78631
RFC 2457
Centril Oct 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Typos, renames and a minor reference change
unicode_idents -> non_ascii_idents
Remove mention of exact spec revision
Describe more how to implement confusable detection and remove mention of scope
fix typo
  • Loading branch information
pyfisch authored Jun 7, 2018
commit 6b2a94a58ef93bb80b7f9b859a2de6a39ec86431
26 changes: 15 additions & 11 deletions text/0000-unicode-idents.md → text/0000-non-ascii-idents.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
- Feature Name: unicode_idents
- Feature Name: non_ascii_idents
- Start Date: 2018-06-03
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)
Expand Down Expand Up @@ -36,17 +36,17 @@ Examples of invalid identifiers are:

* Keywords: `impl`, `fn`, `_` (underscore), ...
* Identifiers starting with numbers or containing "non letters": `42_the_answer`, `third√of7`, `◆◆◆`, ...
* Emojis: 🙂, 🦀, 💩, ...
* Many Emojis: 🙂, 🦀, 💩, ...

Similar Unicode identifiers are normalized: `a1` and `a₁` (a<subscript 1>) refer to the same variable. This also applies to accented characters which can be represented in different ways.

To disallow any Unicode identifiers in a project (for example to ease collaboration or for security reasons) limiting the accepted identifiers to ASCII add this lint to the `lib.rs` or `main.rs` file of your project:

```rust
#![forbid(unicode_idents)]
#![forbid(non_ascii_idents)]
```

Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about easy to confuse names in the same scope. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_unicode_idents)]` annotation on the enclosing function or module.
Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about easy to confuse names in the same scope. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_non_ascii_idents)]` annotation on the enclosing function or module.

## Usage notes

Expand All @@ -59,7 +59,9 @@ Private projects can use any script and language the developer(s) desire. It is
# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

Identifiers in Rust are based on the [Unicode® Standard Annex #31 Unicode Identifier and Pattern Syntax][TR31]. Rust compilers shall use at least Revision 27 of the standard.
Identifiers in Rust are based on the [Unicode® Standard Annex #31 Unicode Identifier and Pattern Syntax][UAX31].

Note: The supported Unicode version should be stated in the documentation.

The lexer defines identifiers as:

Expand All @@ -75,19 +77,21 @@ The lexer defines identifiers as:

Two identifiers X, Y are considered to be equal if their [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y).

A `unicode_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context.
A `non_ascii_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context.
Copy link
Contributor

@estebank estebank Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing my position inline so that anyone can reply to this directly.

I believe that by default we should:

  • Allow a usable subset of unicode characters in identifiers (UAX [Meta] Update README.md based on mailing-list #31 or a variation of it sounds reasonable) by default.
  • Below there's already a proposal for a confusable idents lint, it should be warn by default.
  • Instead of having a binary ASCII/all flag, we should treat the full unicode space as per script flags, allowing projects to specify (perhaps even per scope) a subset of allowable scripts (latin, math, greek, hira/hiragana, ascii/reduced/limited/lets-party-like-its-1960, emoji, etc.). If a list of allowable scripts are defined in a crate, encountering chars outside of them is a hard error. On this error, confusables that are outside of the currently allowed scripts but that have a similar representation in the current script (𝜆 in math -> λ in greek), it should be pointed out with suggestions to either use the allowed script's char or to add the new script to the allowed list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please state the reasoning behind point three? If the supporting arguments were already discussed also include the opinions of other people on the matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being liberal on the default is a contentious point with people bringing multiple positions. Myself I fall under the lets be relatively liberal camp, with maybe a concession to a warn by default lint when falling outside of ascii suggesting specifying allowed scripts. By doing it this way we minimize friction from non-ascii using users, while not heavily burdening ascii-only users (this is debated).

I believe we should allow people to specify specific allowed scripts is a good solution to the problem of cross script confusables sneaking in, without having people forced to decide that it is not worth it to use math symbols for formulas because they're are worried about cyrillic confusables being sneaked into the codebase. This would also allow me to use emojis for identifiers if I so choose, as that script would be disabled by default, but I could add it to the allowed list.

Allowing per scope customization would be interesting to allow people a lot of linting control when deciding that they want to allow a given script internally, but not on the public api. I would not count function argument names as part of the public api for this consideration, as we don't need to write them down, just be able to read them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you say yourself that this is a fair and complete review of the previous arguments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it to be so, yes. I haven't seen my last point around per scope settings raised before, though.

Copy link
Contributor Author

@pyfisch pyfisch Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cyrillic confusables being sneaked into the codebase.

This was discussed under the "malicious contributor" label and was deemed "not much trouble".
How does allowing more codepoints than UAX#31 Default Identifiers the evolution and readability of Rust? (See custom operators and smart quotes) For which people is it important to allow additional characters?
Why is the proposed confusable lint insufficient?
Why aren't ad hoc solutions like regular expressions sufficient?
Why are typos involving Unicode characters different from those only involving ASCII ones?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does allowing more codepoints than UAX#31 Default Identifiers the evolution and readability of Rust? (See custom operators and smart quotes)

I feel that giving partial support for Unicode idents would be a disservice if we don't allow a future proof escape hatch.

For which people is it important to allow additional characters?

Other than novelty uses (emoji), I don't know enough about other languages that might need it (my mother tongue is limited to latin).

Why is the proposed confusable lint insufficient?

Proactive warning and fine grained control for individual projects. A binary gate feels too blunt to mdgiven the scope.

Why aren't ad hoc solutions like regular expressions sufficient?

I believe that the compiler and an editor should be enough to develop in Rust, and it should make an effort to provide help to newcomers for non obvious errors (and here I fall in the same camp of "avoid non ASCII in general" if possible, I just don't want to mandate it).

Why are typos involving Unicode characters different from those only involving ASCII ones?

Tooling support. We already suggest based on levenshtein distance for ASCII typos, we should to the same for cobfusable chars.

Having said all of this, I'm enthusiastically in support of adding the proposed support and will do my best to make sure that the presented worries have a reasonable answer in the shape of compiler assisted support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said all of this, I'm enthusiastically in support of adding the proposed support and will do my best to make sure that the presented worries have a reasonable answer in the shape of compiler assisted support.

Great 👍

Other than novelty uses (emoji), I don't know enough about other languages that might need it (my mother tongue is limited to latin).

Maybe update your position above if it only for novelty?

Proactive warning and fine grained control for individual projects. A binary gate feels too blunt to mdgiven the scope.

Care to give an example from your personal experience where such a feature in a programming language would have made your work easier? (Keep in mind that many languages have Unicode support for years now among them Python, Java and C++)

Tooling support. We already suggest based on levenshtein distance for ASCII typos, we should to the same for cobfusable chars.

We already do (tested on nightly):

error: cannot find macro `príntln!` in this scope
 --> src/main.rs:4:5
  |
4 |     príntln!("hello world")
  |     ^^^^^^^ help: you could try the macro: `println`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe update your position above if it only for novelty?

My position is born of my own ignorance, I defer to people with more experience with non-western-european languages. For me (again, other than the novelty use of emoji and math symbols), my use case can be completely covered by latin (ñ, ü, é) and my counter case would be identifying confusables (which as you point out it would be handled) and bad diagnostics when changed code (smart quotes) is pasted.

Care to give an example from your personal experience where such a feature in a programming language would have made your work easier? (Keep in mind that many languages have Unicode support for years now among them Python, Java and C++)

Entire generation of Spanish speaking students have been "trained" to transliterate to ASCII, where año (year) gets changed to anio (meaningless) or ano (anus). I think that the focus for this document should be placed on the use of other scripts that limit developers more heavily than this.

As pointed out, having a single toggle would make it hard to only allow full unicode support at different levels, like only for literals, idents, in comments, or maybe on idents for internal interfaces, while disallowing them in other places (like public interfaces).


## Confusable detection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the parser will go down a bad route when encountering an identifier that isn't a keyword. We probably want to forbid unicode idents that could be confused with any of the rust keywords.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See fifth paragraph, last sentence:

The compiler uses the same mechanism to check if an identifier is too similar to a keyword.


Rust compilers should detect confusingly similar Unicode identifiers and warn the user about it.

Note: This is *not* a mandatory for all Rust compilers as it requires considerable implementation effort and is not related to the core function of the compiler. It rather is a tool to detect accidental misspellings and intentional homograph attacks.

A new `confusable_unicode_idents` lint is added to the compiler. The default setting is `warn`.
A new `confusable_non_ascii_idents` lint is added to the compiler. The default setting is `warn`.

Note: The confusable detection is set to `warn` instead of `deny` to enable forward compatibility. The list of confusable characters will be extended in the future and programs that were once valid would fail to compile.

The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X in the current scope execute the function `skeleton(X)`. If there exist two distinct identifiers X and Yin the same crate where `skeleton(X) = skeleton(Y)` report it.
The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X execute the function `skeleton(X)`. If there exist two distinct identifiers X and Y in the same crate where `skeleton(X) = skeleton(Y)` report it.

Note: A fast way to implement this is to compute `skeleton` for each identifier once and place the result in a hashmap as a key. If one tries to insert a key that already exists check if the two identifiers differ from each other. If so report the two confusable identifiers.

# Drawbacks
[drawbacks]: #drawbacks
Expand Down Expand Up @@ -121,7 +125,7 @@ It has been suggested that Unicode identifiers should be opt-in instead of opt-o

The current design was chosen because the algorithm and list of similar characters are already provided by the Unicode Consortium. A different algorithm and list of characters could be created. I am not aware of any other programming language implementing confusable detection. The confusable detection was primarily included because homoglyph attacks are a huge concern for some member of the community.

Instead of offering confusable detection the lint `forbid(unicode_idents)` is sufficient to protect project written in English from homoglyph attacks. Projects using different languages are probably either written by students, by a small group or inside a regional company. These projects are not threatened as much as large open source projects by homoglyph attacks but still benefit from the easier debugging of typos.
Instead of offering confusable detection the lint `forbid(non_ascii_idents)` is sufficient to protect project written in English from homoglyph attacks. Projects using different languages are probably either written by students, by a small group or inside a regional company. These projects are not threatened as much as large open source projects by homoglyph attacks but still benefit from the easier debugging of typos.

# Prior art
[prior-art]: #prior-art
Expand All @@ -143,13 +147,13 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\
* Are Unicode characters allowed in `no_mangle` and `extern fn`s?
* How do Unicode names interact with the file system?
* Are crates with Unicode names allowed and can they be published to crates.io?
* Are `unicode_idents` and `confusable_unicode_idents` good names?
* Are `non_ascii_idents` and `confusable_non_ascii_idents` good names?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @eggrobin has pointed out, there are already pairs of ascii characters that are being considered confusable to each other thus the confusable_non_ascii_idents lint should be renamed to confusable_idents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Changed it.

* Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]?
* Should *rustc* accept files in a different encoding than *UTF-8*?

[PEP 3131]: https://www.python.org/dev/peps/pep-3131/
[UAX31]: http://www.unicode.org/reports/tr31/
[TR15]: https://www.unicode.org/reports/tr15/
[TR31]: http://www.unicode.org/reports/tr31/
[TR31Alternative]: http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax
[TR31Layout]: https://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters
[TR39Confusable]: https://www.unicode.org/reports/tr39/#Confusable_Detection
Expand Down