[Feature Request] Rigerously specify e-mail address validation #236

jchadwick-buf · 2024-08-05T23:40:18Z

Feature description:
Email address validation is underspecified and underdocumented, and protovalidate implementations in different languages use very different e-mail parsing codepaths leading to different validation results in edge cases. E-mail validation should be rigorously specified and implemented consistently across languages, as the results of validation should be consistent across programming languages.

Furthermore, the e-mail validation should be as minimally surprising as possible, so we should leverage existing industry standards as much as possible, particularly ones that reflect the real world and don't hinder e.g. internationalization.

Also, the conformance test suite should be expanded to ensure that the edge cases are consistent across implementations.

Proposed implementation or solution:
I suggest we use the e-mail validation specified in the WHATWG HTML standard, for the following reasons:

It is the validation format adopted by web browsers for <input type="email">
RFC 5322, the standard that authoritatively defines e-mail address formatting, is woefully out of touch with real-world implementations.
Standards that build on RFC 5322, like RFC 6531 which adds support for internationalized e-mail addresses, are often incomplete and ambiguous, and often themselves not standardized.
We can lean on regex engines to implement it if we want. Chrome uses it this way, and it is a simple enough regex that it should work fine in more restrictive engines like re2. Since the grammar is very simple and has few productions, hand-written parsers should also be very easy to implement.

I did some exploration into what it would look like to implement RFC 5322-based e-mail address validation, which I will provide here:

Exploring RFC 5322 for e-mail address validation

RFC 5322 rules

Here is a summary of the grammar productions relevant to the local-part of an e-mail address, according to RFC 5322. Per our current validation, productions beginning with 'obs-' should probably be disallowed, as well as productions allowing folding whitespace within e-mail addresses.

We'll ignore the address part, since protovalidate already has an approach to validating hostnames anyways.

; rfc5234 rules
ALPHA           =   %x41-5A / %x61-7A  ; A-Z / a-z
CR              =   %x0D               ; carriage return
LF              =   %x0A               ; linefeed
CRLF            =   CR LF              ; Internet standard newline
DIGIT           =   %x30-39            ; 0-9
DQUOTE          =   %x22               ; " (Double Quote)
HTAB            =   %x09               ; horizontal tab
SP              =   %x20
VCHAR           =   %x21-7E            ; visible (printing) characters
WSP             =   SP / HTAB          ; white space
; folding whitespace
obs-FWS         =   1*WSP *(CRLF 1*WSP)
FWS             =   ([*WSP CRLF] 1*WSP) /  obs-FWS
ctext           =   %d33-39 /          ; Printable US-ASCII
                    %d42-91 /          ;  characters not including
                    %d93-126 /         ;  "(", ")", or "\"
                    obs-ctext
ccontent        =   ctext / quoted-pair / comment
comment         =   "(" *([FWS] ccontent) [FWS] ")"
CFWS            =   (1*([FWS] comment) [FWS]) / FWS
; atom
atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                    "!" / "#" /        ;  characters not including
                    "$" / "%" /        ;  specials.  Used for atoms.
                    "&" / "'" /
                    "*" / "+" /
                    "-" / "/" /
                    "=" / "?" /
                    "^" / "_" /
                    "`" / "{" /
                    "|" / "}" /
                    "~"
atom            =   [CFWS] 1*atext [CFWS]
; quoted string
qtext           =   %d33 /             ; Printable US-ASCII
                    %d35-91 /          ;  characters not including
                    %d93-126 /         ;  "\" or the quote character
                    obs-qtext
quoted-pair     =   ("\" (VCHAR / WSP)) / obs-qp
qcontent        =   qtext / quoted-pair
quoted-string   =   [CFWS]
                    DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                    [CFWS]
word            =   atom / quoted-string
; obsolete productions
obs-NO-WS-CTL   =   %d1-8 /            ; US-ASCII control
                    %d11 /             ;  characters that do not
                    %d12 /             ;  include the carriage
                    %d14-31 /          ;  return, line feed, and
                    %d127              ;  white space characters
obs-ctext       =   obs-NO-WS-CTL
obs-qtext       =   obs-NO-WS-CTL
obs-qp          =   "\" (%d0 / obs-NO-WS-CTL / LF / CR)
obs-local-part  =   word *("." word)
; dot-atom
dot-atom-text   =   1*atext *("." 1*atext)
dot-atom        =   [CFWS] dot-atom-text [CFWS]
; local part
local-part      =   dot-atom / quoted-string / obs-local-part

Simplified RFC 5322 Rules

Here's a version of the above rules with whitespace disallowed outside of quotes and escapes and with obsolete productions removed.

; rfc5234 rules
ALPHA           =   %x41-5A / %x61-7A  ; A-Z / a-z
CR              =   %x0D               ; carriage return
LF              =   %x0A               ; linefeed
CRLF            =   CR LF              ; Internet standard newline
DIGIT           =   %x30-39            ; 0-9
DQUOTE          =   %x22               ; " (Double Quote)
HTAB            =   %x09               ; horizontal tab
SP              =   %x20
VCHAR           =   %x21-7E            ; visible (printing) characters
WSP             =   SP / HTAB          ; white space
; folding whitespace
FWS             =   ([*WSP CRLF] 1*WSP)
ctext           =   %d33-39 /          ; Printable US-ASCII
                    %d42-91 /          ;  characters not including
                    %d93-126           ;  "(", ")", or "\"
ccontent        =   ctext / quoted-pair / comment
comment         =   "(" *([FWS] ccontent) [FWS] ")"
CFWS            =   (1*([FWS] comment) [FWS]) / FWS
; atom
atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                    "!" / "#" /        ;  characters not including
                    "$" / "%" /        ;  specials.  Used for atoms.
                    "&" / "'" /
                    "*" / "+" /
                    "-" / "/" /
                    "=" / "?" /
                    "^" / "_" /
                    "`" / "{" /
                    "|" / "}" /
                    "~"
; quoted string
qtext           =   %d33 /             ; Printable US-ASCII
                    %d35-91 /          ;  characters not including
                    %d93-126.          ;  "\" or the quote character
quoted-pair     =   ("\" (VCHAR / WSP))
qcontent        =   qtext / quoted-pair
quoted-string   =   DQUOTE *([FWS] qcontent) [FWS] DQUOTE
; dot-atom
dot-atom        =   1*atext *("." 1*atext)
; local part
local-part      =   dot-atom / quoted-string

Regular expression translation

It is possible to express this entire grammar using regular expressions, since it doesn't need backtracking or recursion.

; quoted string
qtext           =   /[\x21\x23-\x5b\x5d-\x7e]/
quoted-pair     =   /\\[ \t\x21-\x7E]/
qcontent        =   /[\x21\x23-\x5b\x5d-\x7e]|\\[ \t\x21-\x7E]/
quoted-string   =   /"((([ \t]*[\r\n])?[ \t]+)?[\x21\x23-\x5b\x5d-\x7e]|\\[ \t\x21-\x7E])*(([ \t]*[\r\n])?[ \t]+)?"/
; dot-atom
atext           =   /[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]/
dot-atom        =   /[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+(\.[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+)*/
; local part
local-part      =   /[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+(\.[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+)*|"((([ \t]*[\r\n])?[ \t]+)?[\x21\x23-\x5b\x5d-\x7e]|\\[ \t\x21-\x7E])*(([ \t]*[\r\n])?[ \t]+)?"/

Pseudo-code form

The above regular expression is unreadable and probably pretty slow. Here is the same grammar parsed with Go-like pseudo-code.

matchLocalPart returns the email address after the '@' if the local-part is valid, or an empty string if it is not.

Note that RFC 5322 does not allow for localpart to contain non-US ASCII characters yet. RFC 6531 proposes allowing non-ASCII characters, but it is still in the proposal stage. Either way, we can work on the byte level since we do not care about codepoints above 0x7F. (If we want to adopt the RFC 6531 behavior at any point, I believe we just want to allow >= 0x80 in qtext and atext.)

func matchLocalPart(email string) string {
	if len(email) == 0 {
		return ""
	}
	if email[0] == '"' {
		if email = matchQuotedString(email); len(email) == 0 {
			return ""
		}
	} else if isAText(email[0]) {
		if email = matchDotAtom(email); len(email) == 0 {
			return ""
		}
	}
	if email[0] != '@' {
		return ""
	}
	return email[1:]
}

func matchQuotedString(email string) string {
	email = email[1:]
	for {
		if len(email) == 0 {
			return ""
		}
		switch email[0] {
		case '"':
			return email[1:]
		case '\\':
			if email = email[1:]; len(email) == 0 {
				return ""
			}
			if !isQuotedPair(email[0]) {
				return ""
			}
			email = email[1:]
		default:
			if !isQText(email[0]) && !isWSP(email[0]) {
				return ""
			}
			email = email[1:]
		}
	}
}

func matchDotAtom(email string) string {
	for {
		if len(email) == 0 {
			return ""
		}
		switch email[0] {
		case '@':
			return email
		case '.':
			if email = email[1:]; len(email) == 0 {
				return ""
			}
			fallthrough
		default:
			if !isAText(email[0]) {
				return ""
			}
			email = email[1:]
		}
	}
}

func isAText(b byte) bool {
	return (b >= 'a' && b <= 'z') ||
		(b >= 'A' && b <= 'Z') ||
		(b >= '0' && b <= '9') ||
		b == '!' || b == '#' || b == '$' || b == '%' ||
		b == '&' || b == '*' || b == '+' || b == '-' ||
		b == '/' || b == '=' || b == '?' || b == '^' ||
		b == '_' || b == '`' || b == '{' || b == '|' ||
		b == '}' || b == '~' || b == '\''
}

func isQText(b byte) bool {
	return b == '!' || (b >= '#' && b <= '[') || (b >= ']' && b <= '~')
}

func isQuotedPair(b byte) bool {
	return b == ' ' || b == '\t' || (b >= 0x21 && b <= 0x7e)
}

func isWSP(b byte) bool {
	return b == ' ' || b == '\t' || b == '\r' || b == '\n'
}

Here is a similar implementation in Python. This is written to work on a memoryview since it is more efficient to slice a memoryview than a str. Unlike the Go version, this version uses exception handling for errors.

from typing import Sequence

_AT = ord('@')
_DQUOTE = ord('"')
_BACKSLASH = ord('\\')
_PERIOD = ord('.')

def _match_local_part(email: Sequence[int]) -> Sequence[int]:
    if len(email) == 0:
        raise Exception('Empty address')
    if email[0] == _DQUOTE:
        email = _match_quoted_string(email)
    elif _is_atext(email[0]):
        email = _match_dot_atom(email)
    if email[0] != _AT:
        raise Exception('Invalid address')
    return email[1:]

def _match_quoted_string(email: Sequence[int]) -> Sequence[int]:
    email = email[1:]
    while True:
        if len(email) == 0:
            raise Exception('Unexpected end of address')
        elif email[0] == _DQUOTE:
            return email[1:]
        elif email[0] == _BACKSLASH:
            email = email[1:]
            if len(email) == 0:
                raise Exception('Unexpected end of address')
            if not _is_quoted_pair(email[0]):
                raise Exception('Invalid quoted pair')
            email = email[1:]
        else:
            if not _is_qtext(email[0]) and not _is_wsp(email[0]):
                raise Exception('Invalid local part')
            email = email[1:]

def _match_dot_atom(email: Sequence[int]) -> Sequence[int]:
    while True:
        if len(email) == 0:
            raise Exception('Unexpected end of address')
        if email[0] == _AT:
            return email
        elif email[0] == _PERIOD:
            email = email[1:]
            if len(email) == 0:
                raise Exception('Unexpected end of address')
        if not _is_atext(email[0]):
            raise Exception('Invalid character')
        email = email[1:]

def _is_atext(b: int) -> bool:
    return (
        (b >= 0x61 and b <= 0x7a) or
        (b >= 0x41 and b <= 0x5a) or
        (b >= 0x30 and b <= 0x39) or
        b == 0x21 or b == 0x23 or b == 0x24 or b == 0x25 or
        b == 0x26 or b == 0x27 or b == 0x2a or b == 0x2b or
        b == 0x2d or b == 0x2f or b == 0x3d or b == 0x3f or
        b == 0x5e or b == 0x5f or b == 0x60 or b == 0x7b or
        b == 0x7c or b == 0x7d or b == 0x7e
    )

def _is_qtext(b: int) -> bool:
    return b == 0x21 or (b >= 0x23 and b <= 0x5b) or (b >= 0x5d and b <= 0x7e)

def _is_quoted_pair(b: int) -> bool:
    return b == 0x20 or b == 0x09 or (b >= 0x21 and b <= 0x7e)

def _is_wsp(b: int) -> bool:
    return b == 0x20 or b == 0x09 or b == 0x0d or b == 0x0a

Summary

Implementing RFC 5322 rules in a readable fashion is doable in most target languages using a hand-written parser. It can be done in under 100 lines.

However, while this parser is strict enough to adhere to RFC 5322, it has the caveat that it may be both more strict and more lenient than some real world mail servers in some situations, so it is far from ideal.

An implementation of the WHATWG HTML would be very trivial. The local-part of the HTML version is a strict subset of the RFC 5322 version; specifically, it is almost identical to the dot-atom-text production, and the matchDotAtom/_match_dot_atom psuedo-code examples should be a near match (after allowing codepoints above 0x7f in atext.) Meanwhile, the hostname portion of the e-mail in the WHATWG HTML standard seems to also be a near-exact match for our existing hostname validation that we already also use for e-mail.

The text was updated successfully, but these errors were encountered:

Updates Protobuf to v27 and protovalidate to v0.7.1, and fixes all of the resulting compilation and conformance failures. As one would expect, there was a tremendous amount of troubleshooting involved in this thankfully-relatively-small PR. Here's my log of what happened. I'll try to be succinct, but I want to capture all of the details so my reasoning can be understood in the future. - First, I tried to update protobuf. This led to pulling a newer version of absl. The version of cel-cpp we use did not compile with this version of absl. - Next, I tried to update cel-cpp. However, the latest version of cel-cpp is broken on macOS for two separate reasons <sup>[1](google/cel-cpp#831), [2](https://github.com/google/cel-cpp/issues/832)</sup>. - After taking a break to work on other protovalidate implementations I returned and tried another approach. This time, instead of updating cel-cpp, I just patched it to work with newer absl. Thankfully, this proved surprisingly viable. The `cel_cpp.patch` file now contains this fix too. - Unfortunately, compilation was broken in CI on a non-sense compiler error: ``` error: could not convert template argument 'ptr' from 'const google::protobuf::Struct& (* const)()' to 'const google::protobuf::Struct& (* const)()' ``` It seemed likely to be a compiler issue, thus I was stalled again. - For some reason it finally occurred to me that I probably should just simply update the compiler. In a stroke of accidental rubber-ducking luck, I noticed that GitHub's `ubuntu-latest` had yet to actually move to `ubuntu-24.04`, which has a vastly more up-to-date C++ toolchain than the older `ubuntu-22.04`. This immediately fixed the problem. - E-mail validation is hard. In other languages we fall back on standard library functionality, but C++ puts us at a hard impasse; the C++ standard library hardly concerns itself with application-level functionality like SMTP standards. Anyway, I channeled my frustration at the lack of a consistent validation scheme for e-mail, which culminated into bufbuild/protovalidate#236. For the new failing test cases, we needed to improve the validation of localpart in C++. Lacking any specific reference point, I decided it would be acceptable if the C++ version started adopting ideas from WHATWG HTML email validation. It doesn't move the `localpart` validation to _entirely_ work like WHATWG HTML email validation, as our version still has our specific checks, but now we are a strict subset in protovalidate-cc, so we can remove our additional checks later if we can greenlight adopting the WHATWG HTML standard. - The remaining test failures are all related to ignoring validation rules and presence. The following changes were made: - The algorithm for ignoring empty fields is improved to match the specified behavior closer. - The `ignore` option is now taken into account in addition to the legacy `skipped` and `ignore_empty` options. - Support is added for `IGNORE_IF_DEFAULT_VALUE` - An edge case is added to ignore field presence on synthetic `Map` types. I haven't traced down why, but `has_presence` seems to always be true for fields of synthetic `Map` types in the C++ implementation. (Except in proto3?) And with that I think we will have working Editions support.

jchadwick-buf added the Feature New feature or request label Aug 5, 2024

jchadwick-buf mentioned this issue Aug 6, 2024

Editions fixes bufbuild/protovalidate-cc#58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Rigerously specify e-mail address validation #236

[Feature Request] Rigerously specify e-mail address validation #236

jchadwick-buf commented Aug 5, 2024 •

edited

Loading

RFC 5322 rules

Simplified RFC 5322 Rules

Regular expression translation

Pseudo-code form

Summary

[Feature Request] Rigerously specify e-mail address validation #236

[Feature Request] Rigerously specify e-mail address validation #236

Comments

jchadwick-buf commented Aug 5, 2024 • edited Loading

RFC 5322 rules

Simplified RFC 5322 Rules

Regular expression translation

Pseudo-code form

Summary

jchadwick-buf commented Aug 5, 2024 •

edited

Loading