-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
bpo-30500: Fix urllib.parse.splithost() to parse correctly fragments #1849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. Thanks again to your contribution and we look forward to looking at it! |
@postmasters, thanks for your PR! By analyzing the history of the files in this pull request, we identified @orsenthil, @ncoghlan and @Yhg1s to be potential reviewers. |
@postmasters Can you create an issue for this at https://bugs.python.org and then reference the issue number here in the title? Thanks |
This looks like this will change the behaviour in corner cases. E.g., what happens with empty URL components like splithost("///path;?#")? See https://bugs.python.org/issue22852 about this problem with urlparse and urlsplit. A dramatic change in implementation like this is not good for a bug fix. Also, keep in mind that splithost is barely documented, and borderline unsupported, especially in Python 3. See my summary at https://bugs.python.org/issue27485#msg270215. |
What I meant was that splithost is basically deprecated as an external API. Bug fixes are okay, especially where they fix problems elsewhere in Python’s library, but unnecessary changes in behaviour or features could be a problem. A quick search in my computer’s 2.7 directory shows splithost is used in various external packages. Many (bittorrent, m2crypto, feedparser, . . .) look like forks of internal Python libraries (urllib, xmlrpclib), and a few (subvertpy, twisted) may be original code. |
Lib/urllib/parse.py
Outdated
rest = [fields.params, fields.query, fields.fragment] | ||
if not path and any(rest): | ||
path = '/' | ||
return (None if fields.netloc == '' else fields.netloc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split this long line into a more regular if/else to make the code more readable.
Lib/test/test_urlparse.py
Outdated
@@ -982,6 +982,8 @@ def test_splithost(self): | |||
('www.example.org:80', '')) | |||
self.assertEqual(splithost('/foo/bar/baz.html'), | |||
(None, '/foo/bar/baz.html')) | |||
self.assertEqual(splithost('//127.0.0.1#@host.com'), | |||
('127.0.0.1', '/#@host.com')) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see this corner case tested in other functions like urlparse().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test a comment which contains ":"?
Lib/test/test_urlparse.py
Outdated
@@ -982,6 +982,8 @@ def test_splithost(self): | |||
('www.example.org:80', '')) | |||
self.assertEqual(splithost('/foo/bar/baz.html'), | |||
(None, '/foo/bar/baz.html')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a reference to bpo-30500:
# bpo-30500: comment which contains @ must not be parsed as a username:password
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You misunderstood the issue here. The issue is the precedence of fragments and basic authentication info. I don't think there is a concept of a comment in URL.
@vadmium I'm unsure of deprecation policy. If a function is deprecated, do we care if its behavior changes? After all, it's going to be removed (the purpose of deprecation), right? Others should normally not depend on deprecated functions. Besides, do we want to keep the broken behavior (as compared to regular web browsers) of a deprecated function? Also, where does it say that these functions are deprecated? My plan is to base all these |
@Haypo any further comment? |
The documentation doesn’t say much about these functions. The best I know of is at the bottom of https://docs.python.org/2.7/library/urllib.html#utility-functions, which “recommends” against directly using the functions. But judging by https://bugs.python.org/issue27485 I think there is a consensus to formally deprecate them. I think there are policies about waiting one or more versions between deprecation and removal, and maintaining compatibility with 2.7 until it is dead. I think it is best to avoid changing behaviour of public APIs, including deprecated APIs, unless the behaviour is incorrect. So it may be okay to fix the splithost("//abc#@def") behaviour, but avoid spurious changes to corner cases if practical. Although I haven’t actually tried your code to see if it changes corner cases, or thought if there are other practical ways fix the bug. |
@vadmium so it looks like you're half-okay with this, pending on corner cases. The best I know is all existing test cases, including a few new ones, pass. That kinda gives some sort of assurance. Do let me know if you find out otherwise. Other reviewers, could I have a pair of eyes on this again please? Thanks so much. |
Can you try out the following test cases (or even add them to the test suite if they don’t exist) with your implementation? The results I give are from the existing Python 3.5 implementation: >>> urllib.parse.splithost("///file") # Is the host the empty string or None?
('', '/file')
>>> urllib.parse.splithost("//example.net/file;") # Is the semicolon retained?
('example.net', '/file;')
>>> urllib.parse.splithost("//example.net/file?") # Question mark retained?
('example.net', '/file?')
>>> urllib.parse.splithost("//example.net/file#") # Hash symbol retained?
('example.net', '/file#') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change lacks an entry in Misc/NEWS. You may add a [Security] prefix in your entry.
[ First-time contributor ] Oh, it's your first contrib to CPython: in that case, please also add yourself to Misc/ACKS. |
Since this change fixes a major security vulnerability, I took the liberty of pushing directly to your repository @postmasters. I hope that it is ok with you. I would like to fix this issue quickly. |
@vadmium: "Can you try out the following test cases (or even add them to the test suite if they don’t exist) with your implementation? The results I give are from the existing Python 3.5 implementation:" I added your splithost() test cases to test_urlparse, but the results are different from Python 3.5. Is it a backward incompatible change? Do you see a way to fix it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment against the existing regular expression for my (untested) solution
Misc/NEWS
Outdated
@@ -362,6 +362,11 @@ Extension Modules | |||
Library | |||
------- | |||
|
|||
- [Security] bpo-30500: Fix urllib.parse.splithost() to parse correctly | |||
fragments. For example, ``splithost('http://127.0.0.1#@evil.com/')`` now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. . . to correctly parse fragments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Lib/urllib/parse.py
Outdated
def splithost(url): | ||
"""splithost('//host[:port]/path') --> 'host[:port]', '/path'.""" | ||
global _hostprog | ||
if _hostprog is None: | ||
_hostprog = re.compile('//([^/?]*)(.*)', re.DOTALL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW my first stop for a minimal change with the desired behaviour would be adjusting this line: change [^/?] to [^/?#]. I haven’t actually tested this though. It would be similar to how https://bugs.python.org/issue1457264 was fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done by @postmasters
Lib/urllib/parse.py
Outdated
path = '/' + path | ||
return host_port, path | ||
return None, url | ||
fields = urlparse(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behaviour change for a semicolon (;) with no parameters could be avoided by using urlsplit and urlunsplit rather than urlparse and urlunparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@postmasters rewrote this change to only modify the regex, so it doesn't change the behaviour anymore.
If someone was relying on splithost to faithfully retain the original URL’s path + parameters + query + fragment, dropping empty components may not be compatible. I don’t know of a specific use case, but it seems plausable that this could e.g. break code that maintains a cache based on the URL, if it cannot find the cache because part of the URL was stripped. Also, changing the data type for empty hosts from str to None seems problematic. Also, being unable to distinguish an empty host (e.g. ////evil.com with four slashes) from a missing host (/path) has been reported as a security problem: https://bugs.python.org/issue23505. It would be good to avoid opening one hole as we close another hole. |
Martin, it seems like you understand well the issue. I like your idea of
fixing the regex to reduce the risk of backward compatibility regressions.
Would you mind to create a new PR based on this one?
|
The regex fix is the quickest and easiest to pull. But I'm afraid there might be further corner cases that we are not aware of. Besides, having different versions of utilities like this (split* and urlsplit/urlparse) clashes with "there should be one obvious way". I really do think that split* functions should preferably be refactored to use urlparse rather than continuing down the regex path. Then again, we could also fix the regex first, then work on refactoring later too. |
The regex fix is the quickest and easiest to pull. But I'm afraid there
might be further corner cases that we are not aware of.
Security is a thing, backward compatibilty is another thing. I would feel
bad to have to make a choice between the two. I prefer to be pragmatic: fix
exactly the reported security issue and keep backward compatibilty.
A security fix that can break applications is risky or even not acceptable.
|
@Haypo, I probably won’t be making a pull request in the near future (limited time & priorities). But I am happy if you or someone else makes the change, as long as it works. |
I've updated the code to do what @vadmium suggested. Please take another look. Thanks. |
The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``.
Oh nice @postmasters, you rewrite your urllib change to only modify the regex: I like the new version! I took the liberty of fixing @vadmium requested change on the NEWS entry (parse correctly => correctly parse), since I wrote the NEWS entry and that I would like to get this change merged as soon as possible. I will take care of backports. |
GH-2289 is a backport of this pull request to the 3.6 branch. |
GH-2290 is a backport of this pull request to the 3.5 branch. |
GH-2294 is a backport of this pull request to the 2.7 branch. |
…on#1849) (python#2289) The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``. (cherry picked from commit 90e01e5) (cherry picked from commit 536c1f1)
…#1849) (#2292) The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``. (cherry picked from commit 90e01e5) (cherry picked from commit cc54c1c)
The current regex based splitting produces a wrong result. For example::
http://abc#@def
Web browsers parse that URL as
http://abc/#@def
, that is, the hostis
abc
, the path is/
, and the fragment is#@def
.