Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document for Unicode casemapping #272

Closed
wants to merge 13 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 181 additions & 0 deletions documentation/rfc8265.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
---
title: rfc8265 IRC Unicode-aware Casemapping
layout: spec
copyrights:
-
name: "Daniel Oaks"
period: "2016-2017"
email: "daniel@danieloaks.net"
---
This document describes a Unicode-aware casemapping for IRC, based on the recommendations in [RFC 8265](https://tools.ietf.org/html/rfc8265).

Client and channel names in languages other than English is a much-desired feature. This casemapping allows for an extended set of characters while minimising the risks around allowing these characters, and avoiding the technical limitations of prior solutions to this problem.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119](http://tools.ietf.org/html/rfc2119).


## Casemapping

The server indicates that this Unicode-aware casemapping is in use by advertising the `"UTF8MAPPING"` `RPL_ISUPPORT` parameter with the value `"rfc8265"`. The `"CASEMAPPING"` parameter (with a value such as `"ascii"` or `"rfc1459"`) MUST also be sent, and indicates additional casemapping information. An example of this is shown below:

:irc.example.com 005 dan CASEMAPPING=ascii UTF8MAPPING=rfc8265 :are supported by this server"

When the `rfc8265` casemapping is in use, names (and other messages from the IRCd where possible) MUST be sent using the [UTF-8](https://tools.ietf.org/html/rfc3629) encoding.

The `rfc8265` casemapping uses the PRECIS [UsernameCaseMapped profile][precis] as defined in [Section 3.3 of RFC 8265][precis].

[precis]: https://tools.ietf.org/html/rfc8265#section-3.3


### CASEMAPPING vs UTF8MAPPING

The original `"CASEMAPPING"` `RPL_ISUPPORT` token is preserved, and a new token called `"UTF8MAPPING"` is added to denote this Unicode-compatible casemapping. This is done for two reasons - to allow servers to continue using `rfc1459`-specific casemapping rules if they need to, and ensuring clients who aren't aware of this Unicode casemapping can fall back to a default casemapping.

We recommend using these values:

- `CASEMAPPING`: `ascii`
- `UTF8MAPPING`: `rfc8265`

If the above values are in use, software can skip the ASCII-specific casemapping rules and continue with the [UsernameCaseMapped profile][precis] preperation rules.

If `CASEMAPPING` is `"rfc1459"`, software MUST apply these special casemapping rules before continuing with the [UsernameCaseMapped profile][precis] preperation rules:

- `('{', 0x7B)` is the lower-case equivalent of the character `('[', 0x5B)`
- `('|', 0x7C)` is the lower-case equivalent of the character `('\', 0x5C)`
- `('}', 0x7D)` is the lower-case equivalent of the character `(']', 0x5D)`


### Order of Operations

Names being prepared MUST apply the following rules in the order shown:

1. If casefolding a channel name, strip all channel prefixes (in the ISUPPORT token `CHANTYPES`) right at the start of the name.
2. Preperation using additional casemapping rules specified by `CASEMAPPING` (such as `rfc1459`).
3. Preparation using the PRECIS [UsernameCaseMapped profile][precis].
4. Check for restricted characters.
5. If casefolding a channel name, re-apply the original stripped channel prefixes.

These steps MUST happen in the order shown, or else the restricted characters check may miss characters that should be legitimately restricted.

If a name contains a restricted character (found in either step 1 or 2), that name MUST be rejected by the server and MUST NOT be propogated to other clients. This is done through the appropriate numeric for the command which tried to set or use the invalid name such as `ERR_ERRONEUSNICKNAME`, `ERR_NOSUCHCHANNEL`, or whichever numeric is most appropriate.

The first and last step are specific to casefolding channel names. This is done because by default, the Bidi (bi-directional) rule restrictions mean that a name like `#愛でる` will be rejected (as `#` is considered a 'left-to-right' character, and `愛でる` are considered right-to-left). A channel name like `#愛でる` should have the `#` character stripped, the rest of the string prepared, and then the prefix `#` re-applied. If there is more than one channel prefix character at the start, then all should be stripped (e.g. `##愛でる` would have the `##` stripped and then re-applied).


### Comparisons

When comparing names for equivalency, both strings are casemapped with the above process. After this, they are compared with a standard bit-for-bit comparison.


## Visually Similar Characters

As noted in the [Visually Similar Characters section](https://tools.ietf.org/html/rfc8264#section-12.5) of the PRECIS framework specification, the PRECIS framework itself does not address the issue of eliminating all possible visually similar characters.

With the new allowed Unicode characters comes the ability to use characters that look the same. For example, `E (0x45)`, `Ε (U+0395)` (Greek Capital Letter Epsilon), and `Е (U+0415)` (Cyrillic Capital Letter IE) look the same in most fonts, but are treated as separate characters by this casemapping. More examples of these can be found in Unicode's [Confusables document](https://www.unicode.org/Public/security/latest/confusables.txt).

Unicode skeletonisation is the method we recommend to combat this. For each identifier (nick/channel name) on the server, a 'skeleton' is generated by taking the **casefolded** name, and then applying to it the transformations described in the [Unicode Security Mechanisms document](http://unicode.org/reports/tr39/#Confusable_Detection). These skeletons, if used, MUST ONLY be used for comparison, and not as any user-visible identifier (as they intentionally contain complicated mixes of scripts and characters). When users change nicknames or create new channels, the casefolded names should be compared and the skeletons should also be compared to ensure that both are globally-unique (with any non-unique names rejected outright). This seems to be the most reliable method as of right now, but does require storing the skeletons of all in-use names for comparison purposes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't what we implemented in oragono --- we skeletonize the unfolded, original identifier (the one that is displayed to the users), and only then apply a round of width and case normalization. The rationale is that an initial round of casefolding may lose information about visual confusability. Hypothetically, you could have a non-Latin character with both uppercase and lowercase forms, such that its uppercase form is visually confusable with a Latin character but its lowercase form is visually distinct. Casefolding first would allow an impersonation attack using the uppercase character.

To be honest, I think it would help to see how this plays out in the wild --- try to get real-world user stories from people using non-Latin scripts, also see if we can get some Unicode experts to play with the implementation and try to break it.


Otherwise, server authors may wish to only allow characters from specific character sets of locales to be used, only allow characters from a list of known, non-confusable characters. Other recommendations are available in the [Visually Similar Characters section](https://tools.ietf.org/html/rfc8264#section-12.5) of the PRECIS framework specification. Names that have the opportunity to be confusing SHOULD be disallowed by servers.


## Restricted Characters

This section contains a list of characters which (currently) cause protocol issues or are otherwise undesirable, and are recommended to be disallowed from use. The justifications for disallowing these characters is listed beside them.

Servers and bouncers may disallow any characters they deem necessary.


### All Names

Nicknames, usernames, hostnames and channel names should not contain the following characters:

* `(' ', 0x20)` - Separates parameters.
* `(',', 0x2C)` - Used as a separator.

In addition, the first character of nicknames, usernames, hostnames and channel names should not be:

* `(':', 0x3A)` - Separates trailing parameter.

To be clear, these characters will also be repeated below.

#### Protocol Framing

These characters should be disallowed either by the protocol itself or because of how they are used in message framing:

* `('\0', 0x00)` - Disallowed by the protocol.
* `('\n', 0x0A)` - Used in message framing.
* `('\r', 0x0D)` - Used in message framing.


### Nicknames

Nicknames should not contain the following characters:

* `(' ', 0x20)` - Separates parameters.
* `(',', 0x2C)` - Used as a separator.
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.
* `('.', 0x2E)` - Denotes a server name (some clients recognise a client vs server by whether the name contains a period).
* `('!', 0x21)` - Separates nickname from username.
* `('@', 0x40)` - Separates username from hostname.

In addition to the above restrictions, the first character of a nickname should not be:

* `(':', 0x3A)` - Separates trailing parameter.
* Channel membership prefixes as defined in the `PREFIX` parameter of `RPL_ISUPPORT` - Protocol conflicts.
* Channel prefixes as defined in the `CHANTYPES` parameter of `RPL_ISUPPORT` - Protocol conflicts.


### Usernames

Usernames should not contain the following characters:

* `(' ', 0x20)` - Separates parameters.
* `(',', 0x2C)` - Used as a separator.
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.
* `('!', 0x21)` - Separates nickname from username.
* `('@', 0x40)` - Separates username from hostname.

In addition, the first character of a username should not be:

* `(':', 0x3A)` - Separates trailing parameter.


### Hostnames

Hostnames should not contain the following charactes:

* `(' ', 0x20)` - Separates parameters.
* `(',', 0x2C)` - Used as a separator.
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.
* `('!', 0x21)` - Separates nickname from username.
* `('@', 0x40)` - Separates username from hostname.

In addition, the first character of a hostname should not be:

* `(':', 0x3A)` - Separates trailing parameter.

Hostnames that are looked up on client connection should also be checked with the traditional hostname rules to ensure it is valid.


### Channel Names

Channel names must start with a channel prefix as defined in the `CHANTYPES` parameter of `RPL_ISUPPORT` (005).

Channel names should not contain the following characters:

* `(' ', 0x20)` - Separates parameters.
* `(',', 0x2C)` - Used as a separator.


## Security Considerations

New Unicode-capable casemappings have a considerable security impact.

Introducing so many new characters in names brings the possibility for those new characters to break the protocol. We aim to prevent this from being an issue by outlining these problematic characters in the Restricted Characters section.

With Unicode characters comes the possibility of having visually similar characters (sometimes called 'confusables'). We acknowledge this and outline several ways to combat this possibility in the Visually Similar Characters section.

Legacy clients may have issues working correctly with Unicode names. It is acknowledged that issues may exist around this, and in particular it is very difficult to understand how clients may react to Unicode names given the very vast base of clients out there today. I do not attempt to address this concern within this document.