Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document for Unicode casemapping #272

Closed
wants to merge 13 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions documentation/rfc7700.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
title: rfc7700 IRC Casemapping
layout: spec
copyrights:
-
name: "Daniel Oaks"
period: "2016"
email: "daniel@danieloaks.net"
---
This document describes a unicode-aware casemapping for IRC, based on the recommendations in [RFC 7700](https://tools.ietf.org/html/rfc7700).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode is a proper noun so it should be capitalised.


Client and channel names in languages other than English is a much-desired feature. This casemapping allows for an extended set of characters while minimising the risks around allowing these characters, and avoiding the technical limitations of prior solutions to this problem.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119](http://tools.ietf.org/html/rfc2119).


## Casemapping

The server indicates that this casemapping is in use by advertising the `"CASEMAPPING"` `RPL_ISUPPORT` parameter with the value `"rfc7700"`. An example of this is shown below:

:irc.example.com 005 dan CASEMAPPING=rfc7700 :are supported by this server"

When the `rfc7700` casemapping is in use, names (and other messages from the IRCd where possible) MUST be sent using the [UTF-8](https://tools.ietf.org/html/rfc3629) encoding.

The `rfc7700` casemapping uses the PRECIS [Nickname profile](https://tools.ietf.org/html/rfc7700#section-2) as defined in [Section 2 of RFC 7700](https://tools.ietf.org/html/rfc7700#section-2).


### Order of Operations

Names being prepared MUST apply the following rules in the order shown:

1. Preperation using the PRECIS [Nickname profile](https://tools.ietf.org/html/rfc7700#section-2).
2. Check for restricted characters.

These steps MUST happen in the order shown, or else the restricted characters check may miss characters that should be legitimately restricted.

If a name does contain a restricted character (whether disallowed by the [Nickname profile](https://tools.ietf.org/html/rfc7700#section-2.2) or this document), it MUST be rejected by the server and MUST NOT be propogated to other clients. This is done through the appropriate numeric for the command which tried to set or use the invalid name such as `ERR_ERRONEUSNICKNAME`, `ERR_NOSUCHCHANNEL`, or whichever numeric is most appropriate.
Copy link
Contributor

@grawity grawity Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point better use the named link syntax:

The `rfc7700` casemapping uses the PRECIS [Nickname profile][precis] as defined in [Section 2 of RFC 7700][precis].

[precis]: https://tools.ietf.org/html/rfc7700#section-2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I changed the other links above to use the named link syntax since they were all referring to the same URL. The link here and the link to rfc7700 up the top differ from the others since they're linking to different sections (and the url's only replicated once throughout the doc).



### Comparisons

When comparing user names or channel names to check for equivalency, both strings are casemapped with the above process. After this, they are compared with a standard bit-for-bit comparison.


## Restricted Characters

Certain characters mean special things to the protocol. This section details characters that are not allowed in specific name types.


### All Names

Nicknames, usernames, hostnames and channel names cannot contain the following characters:

* `(' ', 0x20)` - Separates parameters.
* `(':', 0x3A)` - Separates trailing parameter.
* `(',', 0x2C)` - Used as a separator.
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.

To be clear, these characters will also be repeated below.

#### Protocol Framing

These characters are disallowed either by the protocol itself or because of how they are used in protocol framing. These won't be repeated below, as sane implementations either break the message on or disallow the character in streams:

* `('\0', 0x00)` - Disallowed by the protocol.
* `('\n', 0x0A)` - Used in message framing.
* `('\r', 0x0D)` - Used in message framing.


### Nicknames

Nicknames cannot contain the following characters:

* `(' ', 0x20)` - Separates parameters.
* `(':', 0x3A)` - Separates trailing parameter.
Copy link
Contributor

@grawity grawity Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This & identical entries below seem unnecessary; : only has special meaning as the first character of a parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, but since it's already disallowed and I could see it causing possible confusion with libraries that split parameters strangely, figured it was better to disallow it. If we figure it's not required I can definitely remove it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well forbid : in privmsgs, topics, etc. A library blindly splitting on : is not "strange", it's outright buggy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's fair. In that case, I can just note that the first letter of one can't be :? (since if i.e. a nickname started with : then you wouldn't be able to use it in normal messages)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should be fine under the "first character" list. (For channels it's already implied by CHANTYPE.)

* `(',', 0x2C)` - Used as a separator.
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.
* `('.', 0x2E)` - Denotes a server name.
* `('!', 0x21)` - Separates username from hostname.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previously

* `('@', 0x40)` - Separates username from hostname.

In addition to the above restrictions, the first character of a nickname cannot be:

* `('-', 0x2D)` - Disallowed.
* `('0', 0x30)` - Disallowed.
* `('1', 0x31)` - Disallowed.
* `('2', 0x32)` - Disallowed.
* `('3', 0x33)` - Disallowed.
* `('4', 0x34)` - Disallowed.
* `('5', 0x35)` - Disallowed.
* `('6', 0x36)` - Disallowed.
* `('7', 0x37)` - Disallowed.
* `('8', 0x38)` - Disallowed.
* `('9', 0x39)` - Disallowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not allowing numbers as the first char of a nick shouldn't be in the spec for these reasons:

  • Servers already change the nick of clients to nicks starting with a number e.g. in case of collision and with this restriction that is a violation of the spec.
  • Presently most (or all) servers don't allow nicks starting with numbers but in the future servers should be able to relax this restriction without updating the casemapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing stopping servers from accepting a subset of nicks allowed by this spec (they can send an invalid nick numeric for any nick they don't like) so servers can still disallow digits if they want but they cannot allow more nicks than what this spec allows. Also clients must be prepared to see nicks starting with digits.

* Channel membership prefixes as defined in the `PREFIX` parameter of `RPL_ISUPPORT` - Protocol conflicts.
* Channel prefixes as defined in the `CHANTYPES` parameter of `RPL_ISUPPORT` - Protocol conflicts.


### Usernames

Usernames cannot contain the following characters:

* `(' ', 0x20)` - Separates parameters.
* `(':', 0x3A)` - Separates trailing parameter.
* `(',', 0x2C)` - Used as a separator.
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.
* `('!', 0x21)` - Separates username from hostname.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix: separates nickname from username

* `('@', 0x40)` - Separates username from hostname.


### Hostnames

Hostnames cannot contain the following charactes:

* `(' ', 0x20)` - Separates parameters.
* `(':', 0x3A)` - Separates trailing parameter.
Copy link
Contributor

@attilamolnar attilamolnar Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPv6 IPs need : in the hostname (spotted by @jobe1986)

Copy link
Member

@jwheare jwheare Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They even technically can have it as the first character. Which is a bit problematic for 352 RPL_WHOREPLY and 311 RPL_WHOISUSER.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Servers add a 0 prefix to IPv6 IPs beginning with : so that's not a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, meant to remove those with another change. This has been removed.

* `(',', 0x2C)` - Used as a separator.
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.
* `('!', 0x21)` - Separates username from hostname.
* `('@', 0x40)` - Separates username from hostname.

Hostnames that are looked up on client connection should also be checked with the traditional hostname rules to ensure it is valid.


### Channel Names

Channel names must start with a channel prefix as defined in the `CHANTYPES` parameter of `RPL_ISUPPORT` (005).

Channel names cannot contain the following characters:

* `(' ', 0x20)` - Separates parameters.
* `(',', 0x2C)` - Used as a separator.
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.
Copy link
Member

@jwheare jwheare Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mask matching in channel names a thing? * and ? are valid channel characters at the moment, this seems overly restrictive. (spotted by @jobe1986)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, removed those

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We (InspIRCd) use glob matching on channel names in various places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaberUK Example? Do you also forbid those characters in channel names or is there just no way to specify them without accidentally over-globbing?

Actually, I just tested and was able to create a channel on Insp with both * and ?. I think the recommendation should probably not go against existing valid characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwheare We don't presently forbid them although they are used in various places like e.g.

https://github.com/inspircd/inspircd/blob/master/docs/conf/modules.conf.example#L714

This does unfortunately result in some problems like what you mentioned though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



## Security Considerations

New unicode-capable casemappings have a considerable security impact.

With the large numbers of new characters allowed comes the risk of introducing confusion for users. The PRECIS framework (much like the earlier framework [stringprep](https://tools.ietf.org/html/rfc3454)) aims to avoid this through mapping confusable characters to a single base character, and by allowing specific known-good characters.

The PRECIS framework represents the most modern standardized solution today for doing this sort of mapping and handling of internationalized names, and should mitigate most of the issues around this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a highly misleading statement.

Reading Section 12.5 (Security Considerations - Visually Similar Characters) of RFC 7564 it says:

Because PRECIS-compliant strings can contain almost any properly encoded Unicode code point, it can be relatively easy to fake or mimic some strings in systems that use the PRECIS framework. The fact that some strings are easily confused introduces security vulnerabilities of the kind that have also plagued the World Wide Web, specifically the phenomenon known as phishing.

[...]

Because it is impossible to map visually similar characters without a great deal of context (such as knowing the font families used), the PRECIS framework does nothing to map similar-looking characters together, nor does it prohibit some characters because they look like others.

[...]

The challenges inherent in supporting the full range of Unicode code points have in the past led some to hope for a way to programmatically negotiate more restrictive ranges based on locale, script, or other relevant factors; to tag the locale associated with a particular string; etc. As a general-purpose internationalization technology, the PRECIS framework does not include such mechanisms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm mistaken, I believe this would be covered by the rules of the Nickname profile itself here (specifically, 3+4+5). Regardless, I'll have another read over both those documents and probably adjust the text here to make it more clear exactly what I'm referring to, thanks for pointing this out.


Another issue with allowing such a wide range of characters is the ability for those new characters to break the protocol. The section in this document regarding Restricted Characters aims to prevent this from being an issue.

Legacy clients may have issues working correctly with unicode names. It is acknowledged that issues may exist around this, and in particular it is very difficult to understand how clients may react to unicode names given the very vast base of clients out there today. I do not attempt to address this concern within this document.