Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-ASCII identifiers #2457

Merged
merged 26 commits into from
Oct 29, 2018
Merged
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
ec728b3
Initial draft of unicode-idents RFC
pyfisch Jun 3, 2018
4c1bda9
Include expected Usage Notes and minor changes
pyfisch Jun 4, 2018
619f5b4
Improve descriptions and fix typos
pyfisch Jun 4, 2018
142d0bc
ACII -> ASCII
pyfisch Jun 4, 2018
6b2a94a
Typos, renames and a minor reference change
pyfisch Jun 7, 2018
3e19d26
Update Reference-level explanation
pyfisch Jun 8, 2018
a4830a1
Consider identifiers for confusable detection
pyfisch Jun 9, 2018
12d0623
Note difference between Python and Rust
pyfisch Jun 10, 2018
79bbc8e
Remove mention of scope from guide explanation
pyfisch Jun 10, 2018
41f0723
Rename confusable_non_ascii_idents to confusable_idents
pyfisch Jun 10, 2018
3c96d81
Conformance statement
pyfisch Jun 12, 2018
940dab5
Remove stray "is"
pyfisch Jun 12, 2018
da43d09
Add that non-ASCII idents observe UAX31-R3
pyfisch Jun 15, 2018
0e0ca66
Add details for fs, extern, lints
pyfisch Jun 16, 2018
935c917
Add two questions about debuggers and name mangling
pyfisch Jul 10, 2018
8d548d4
Add exotic codepoint detection and mixed script lints
pyfisch Aug 15, 2018
9356fc1
+ Reusability
Manishearth Oct 15, 2018
40d53f5
Global mixed script confusables lint
Manishearth Oct 15, 2018
7732810
notable code points for less_used_codepoints
Manishearth Oct 15, 2018
e3f3692
Mention user-supplied strings
Manishearth Oct 16, 2018
d389a9c
Add unresolved Q regarding const pat confusion (rust-lang/rust#7526).
pnkfelix Oct 19, 2018
70297a9
Remove old mixed scripts lints
Manishearth Oct 19, 2018
9bf90df
Add new mixed_script_confusables lint
Manishearth Oct 19, 2018
a6da03a
Add unresolved questions for RTL and terminal width
Manishearth Oct 20, 2018
c4dff64
Allow bare underscore identifiers
Manishearth Oct 20, 2018
0c78631
RFC 2457
Centril Oct 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Improve descriptions and fix typos
Thanks to SimonSapin for the suggestions.
  • Loading branch information
pyfisch authored Jun 4, 2018
commit 619f5b4ed000dffaa504aaf1502ac232d7bd2e52
11 changes: 5 additions & 6 deletions text/0000-unicode-idents.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,7 @@ Identifiers include variable names, function and trait names and module names. T

Examples of valid identifiers are:

* English language words: `color`, `image_width`, `line2`, `Photo`, `_unused`, ...
* ASCII words in foreign languages: `die_eisenbahn`, `el_tren`, `artikel_1_grundgesetz`
* ACII letters and digits: `image_width`, `line2`, `Photo`, `el_tren`, `_unused`
* words containing accented characters: `garçon`, `hühnervögel`
* identifiers in other scripts: `Москва`, `東京`, ...

Expand All @@ -53,9 +52,9 @@ Some Unicode character look confusingly similar to each other or even identical

All code written in the Rust Language Organization (*rustc*, tools, std, common crates) will continue to only use ASCII identifiers and the English language.

For open source crates it is recommended to write them in English and use ASCII-only. An exception should be made if the application domain (e.g. math) benefits from Unicode and the target audience (e.g. for a crate interfacing with Russian passports) is comfortable with the used language and characters. Additionally crates should provide an ASCII-only API.
For open source crates it is suggested to write them in English and use ASCII-only. An exception can be made if the application domain (e.g. math) benefits from Unicode and the target audience (e.g. for a crate interfacing with Russian passports) is comfortable with the used language and characters. Additionally crates should consider to provide an ASCII-only API.

Private projects can use any script and language the developer(s) desire. It is still a good idea (as with any language feature) not to overuse it.
Private projects can use any script and language the developer(s) desire. It is still a good idea (as with any language feature) not to overdo it.

# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation
Expand All @@ -74,7 +73,7 @@ The lexer defines identifiers as:

`XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed.

Two identifiers X, Y are considered to be equal if there [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y).
Two identifiers X, Y are considered to be equal if their [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y).

A `unicode_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the lint should be allow by default. It should at least be warn by default. Starting to use non ascii idents should be a conscious choice, not an accidental one.

As for checking imported idents as well, I think this should be a separate lint. You might want to be able to import something from a foreign language crate but not want to have foreign language idents in your own code.

Copy link

@shingtaklam1324 shingtaklam1324 Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reposting my comment from #2455 (comment), I think that there should be different lints, with different levels, as some languages should be allow/warn by default, and some others should be deny by default.


Expand Down Expand Up @@ -159,4 +158,4 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\
[Julia Unicode PR]: https://github.com/JuliaLang/julia/pull/19464
[Java]: https://docs.oracle.com/javase/specs/jls/se10/html/jls-3.html#jls-3.8
[JavaScript]: http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords
[Go]: https://golang.org/ref/spec#Identifiers
[Go]: https://golang.org/ref/spec#Identifiers