-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make METHRE RFC-7230 compliant #3235
Conversation
After review of #3233 I found this inconsistentcy between our regex and the spec. If fix is ok, will finish the rest bits. |
24a60db
to
881a7a0
Compare
@kxepal could you look on failed tests? |
Oh, I didn't run full test suite, just test_http_parser. Now should be fixed. |
928df82
to
9272c20
Compare
Codecov Report
@@ Coverage Diff @@
## master #3235 +/- ##
=======================================
Coverage 98.07% 98.07%
=======================================
Files 43 43
Lines 7856 7856
Branches 1354 1354
=======================================
Hits 7705 7705
Misses 59 59
Partials 92 92
Continue to review full report at Codecov.
|
aiohttp/http_parser.py
Outdated
@@ -29,7 +29,7 @@ | |||
'RawRequestMessage', 'RawResponseMessage') | |||
|
|||
ASCIISET = set(string.printable) | |||
METHRE = re.compile('[A-Z0-9$-_.]+') | |||
METHRE = re.compile("[!#$%&'*+\-.^_`|~0-9A-Za-z]+") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think It might make sense to put a comment with that rule somewhere nearby:
# method = token
# tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
# "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
# token = 1*tchar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is git blame with that information, but having a copy in source code wouldn't hurt anyone. Added that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know about blame. But it's visually more convenient to see it next to the code. Like it's done in yarl, for example.
aiohttp/http_parser.py
Outdated
# tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." / | ||
# "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA | ||
# token = 1*tchar | ||
METHRE = re.compile("[!#$%&'*+\-.^_`|~0-9A-Za-z]+") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use a raw-string literal r''
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, +
and *
might also need escaping and |
needs escaping (otherwise it divides groups in []
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and .
might stand for any single character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably they have restricted meanings within []
, but IMHO better safe than sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All true, but not for []
sets.
>>> import re
>>> re.match('[.]', 't')
>>> re.match('[.]', 't') is None
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
digits can be replaced with \d
(to match style below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they cannot. \d
means all the unicode numbers, not just ascii ones:
>>> from hypothesis.strategies import from_regex
>>> thing = from_regex('^\d$').example()
>>> re.match('\d', thing)
<_sre.SRE_Match object; span=(0, 1), match='𝟙'>
>>> thing
'𝟙'
>>> unicodedata.category(thing)
'Nd'
>>> unicodedata.name(thing)
'MATHEMATICAL DOUBLE-STRUCK DIGIT ONE'
@@ -540,7 +540,7 @@ def test_http_request_parser_two_slashes(parser): | |||
|
|||
def test_http_request_parser_bad_method(parser): | |||
with pytest.raises(http_exceptions.BadStatusLine): | |||
parser.feed_data(b'!12%()+=~$ /get HTTP/1.1\r\n\r\n') | |||
parser.feed_data(b'=":<G>(e),[T];?" /get HTTP/1.1\r\n\r\n') |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? These characters are valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's a negative test. This reads confusing. Let's change it to parametrized test, smth like:
@pytest.mark.parametrize(
'invalid_byte',
list(b'=":<>(),[];?"'),
)
@pytest.raises(http_exceptions.BadStatusLine)
def test_http_request_parser_bad_method(parser, invalid_byte):
valid_method = b'Get'
invalid_method = valid_method + invalid_byte
invalid_request_line = invalid_method + b' /get HTTP/1.1\r\n\r\n'
parser.feed_data(invalid_request_line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pytest.raises
at least.
Personally I dont insist on pytest parametrization
69b5734
to
7351a1f
Compare
Definition of method is: ``` method = token tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA token = 1*tchar ``` So we had two issues: 1. Not all the characters were allowed. 2. Actually, we did allowed too much characters since `$-_` parsed as: "all characters in range between code 36 ($) and code 95 (_)" instead of "characters with codes 36, 45 and 95". So we did match methods like `[GET]` which are malformed according the spec.
@kxepal please merge when ready |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs. |
Definition of method is:
So he had two issues:
$-_
parsed as:"all characters in range between code 36 ($) and code 95 (_)" instead of
"characters with codes 36, 45 and 95". So we did match methods like
[GET]
which are malformed according the spec.