-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Rigerously specify e-mail address validation #236
Labels
Feature
New feature or request
Comments
rodaine
pushed a commit
to bufbuild/protovalidate-cc
that referenced
this issue
Aug 7, 2024
Updates Protobuf to v27 and protovalidate to v0.7.1, and fixes all of the resulting compilation and conformance failures. As one would expect, there was a tremendous amount of troubleshooting involved in this thankfully-relatively-small PR. Here's my log of what happened. I'll try to be succinct, but I want to capture all of the details so my reasoning can be understood in the future. - First, I tried to update protobuf. This led to pulling a newer version of absl. The version of cel-cpp we use did not compile with this version of absl. - Next, I tried to update cel-cpp. However, the latest version of cel-cpp is broken on macOS for two separate reasons <sup>[1](google/cel-cpp#831), [2](https://github.com/google/cel-cpp/issues/832)</sup>. - After taking a break to work on other protovalidate implementations I returned and tried another approach. This time, instead of updating cel-cpp, I just patched it to work with newer absl. Thankfully, this proved surprisingly viable. The `cel_cpp.patch` file now contains this fix too. - Unfortunately, compilation was broken in CI on a non-sense compiler error: ``` error: could not convert template argument 'ptr' from 'const google::protobuf::Struct& (* const)()' to 'const google::protobuf::Struct& (* const)()' ``` It seemed likely to be a compiler issue, thus I was stalled again. - For some reason it finally occurred to me that I probably should just simply update the compiler. In a stroke of accidental rubber-ducking luck, I noticed that GitHub's `ubuntu-latest` had yet to actually move to `ubuntu-24.04`, which has a vastly more up-to-date C++ toolchain than the older `ubuntu-22.04`. This immediately fixed the problem. - E-mail validation is hard. In other languages we fall back on standard library functionality, but C++ puts us at a hard impasse; the C++ standard library hardly concerns itself with application-level functionality like SMTP standards. Anyway, I channeled my frustration at the lack of a consistent validation scheme for e-mail, which culminated into bufbuild/protovalidate#236. For the new failing test cases, we needed to improve the validation of localpart in C++. Lacking any specific reference point, I decided it would be acceptable if the C++ version started adopting ideas from WHATWG HTML email validation. It doesn't move the `localpart` validation to _entirely_ work like WHATWG HTML email validation, as our version still has our specific checks, but now we are a strict subset in protovalidate-cc, so we can remove our additional checks later if we can greenlight adopting the WHATWG HTML standard. - The remaining test failures are all related to ignoring validation rules and presence. The following changes were made: - The algorithm for ignoring empty fields is improved to match the specified behavior closer. - The `ignore` option is now taken into account in addition to the legacy `skipped` and `ignore_empty` options. - Support is added for `IGNORE_IF_DEFAULT_VALUE` - An edge case is added to ignore field presence on synthetic `Map` types. I haven't traced down why, but `has_presence` seems to always be true for fields of synthetic `Map` types in the C++ implementation. (Except in proto3?) And with that I think we will have working Editions support.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Feature description:
Email address validation is underspecified and underdocumented, and protovalidate implementations in different languages use very different e-mail parsing codepaths leading to different validation results in edge cases. E-mail validation should be rigorously specified and implemented consistently across languages, as the results of validation should be consistent across programming languages.
Furthermore, the e-mail validation should be as minimally surprising as possible, so we should leverage existing industry standards as much as possible, particularly ones that reflect the real world and don't hinder e.g. internationalization.
Also, the conformance test suite should be expanded to ensure that the edge cases are consistent across implementations.
Proposed implementation or solution:
I suggest we use the e-mail validation specified in the WHATWG HTML standard, for the following reasons:
<input type="email">
I did some exploration into what it would look like to implement RFC 5322-based e-mail address validation, which I will provide here:
Exploring RFC 5322 for e-mail address validation
RFC 5322 rules
Here is a summary of the grammar productions relevant to the local-part of an e-mail address, according to RFC 5322. Per our current validation, productions beginning with 'obs-' should probably be disallowed, as well as productions allowing folding whitespace within e-mail addresses.
We'll ignore the address part, since protovalidate already has an approach to validating hostnames anyways.
Simplified RFC 5322 Rules
Here's a version of the above rules with whitespace disallowed outside of quotes and escapes and with obsolete productions removed.
Regular expression translation
It is possible to express this entire grammar using regular expressions, since it doesn't need backtracking or recursion.
Pseudo-code form
The above regular expression is unreadable and probably pretty slow. Here is the same grammar parsed with Go-like pseudo-code.
matchLocalPart
returns the email address after the '@' if the local-part is valid, or an empty string if it is not.Note that RFC 5322 does not allow for localpart to contain non-US ASCII characters yet. RFC 6531 proposes allowing non-ASCII characters, but it is still in the proposal stage. Either way, we can work on the byte level since we do not care about codepoints above 0x7F. (If we want to adopt the RFC 6531 behavior at any point, I believe we just want to allow >= 0x80 in
qtext
andatext
.)Here is a similar implementation in Python. This is written to work on a
memoryview
since it is more efficient to slice amemoryview
than astr
. Unlike the Go version, this version uses exception handling for errors.Summary
Implementing RFC 5322 rules in a readable fashion is doable in most target languages using a hand-written parser. It can be done in under 100 lines.
However, while this parser is strict enough to adhere to RFC 5322, it has the caveat that it may be both more strict and more lenient than some real world mail servers in some situations, so it is far from ideal.
An implementation of the WHATWG HTML would be very trivial. The local-part of the HTML version is a strict subset of the RFC 5322 version; specifically, it is almost identical to the
dot-atom-text
production, and thematchDotAtom
/_match_dot_atom
psuedo-code examples should be a near match (after allowing codepoints above0x7f
in atext.) Meanwhile, the hostname portion of the e-mail in the WHATWG HTML standard seems to also be a near-exact match for our existing hostname validation that we already also use for e-mail.The text was updated successfully, but these errors were encountered: