Skip to content

bpo-30500: Fix urllib.parse.splithost() to parse correctly fragments #1849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 20, 2017
Merged

bpo-30500: Fix urllib.parse.splithost() to parse correctly fragments #1849

merged 1 commit into from
Jun 20, 2017

Conversation

postmasters
Copy link
Contributor

The current regex based splitting produces a wrong result. For example::

http://abc#@def

Web browsers parse that URL as http://abc/#@def, that is, the host
is abc, the path is /, and the fragment is #@def.

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

Thanks again to your contribution and we look forward to looking at it!

@mention-bot
Copy link

@postmasters, thanks for your PR! By analyzing the history of the files in this pull request, we identified @orsenthil, @ncoghlan and @Yhg1s to be potential reviewers.

@Mariatta
Copy link
Member

@postmasters Can you create an issue for this at https://bugs.python.org and then reference the issue number here in the title? Thanks

@vadmium
Copy link
Member

vadmium commented May 29, 2017

This looks like this will change the behaviour in corner cases. E.g., what happens with empty URL components like splithost("///path;?#")? See https://bugs.python.org/issue22852 about this problem with urlparse and urlsplit. A dramatic change in implementation like this is not good for a bug fix.

Also, keep in mind that splithost is barely documented, and borderline unsupported, especially in Python 3. See my summary at https://bugs.python.org/issue27485#msg270215.

@postmasters postmasters changed the title urllib: Simplify splithost by calling into urlparse. [b.p.o 30500] urllib: Simplify splithost by calling into urlparse. May 29, 2017
@postmasters
Copy link
Contributor Author

@Mariatta Done.

@vadmium You rightly mentioned that splithost is not well documented. Does that not contradict your first statement that this is a "dramatic change"? Do you know of any other user of splithost than urllib itself? Thanks.

@Mariatta Mariatta changed the title [b.p.o 30500] urllib: Simplify splithost by calling into urlparse. bpo-30500: urllib: Simplify splithost by calling into urlparse. May 29, 2017
@vadmium
Copy link
Member

vadmium commented May 29, 2017

What I meant was that splithost is basically deprecated as an external API. Bug fixes are okay, especially where they fix problems elsewhere in Python’s library, but unnecessary changes in behaviour or features could be a problem.

A quick search in my computer’s 2.7 directory shows splithost is used in various external packages. Many (bittorrent, m2crypto, feedparser, . . .) look like forks of internal Python libraries (urllib, xmlrpclib), and a few (subvertpy, twisted) may be original code.

rest = [fields.params, fields.query, fields.fragment]
if not path and any(rest):
path = '/'
return (None if fields.netloc == '' else fields.netloc,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split this long line into a more regular if/else to make the code more readable.

@vstinner vstinner added the type-security A security issue label May 29, 2017
@@ -982,6 +982,8 @@ def test_splithost(self):
('www.example.org:80', ''))
self.assertEqual(splithost('/foo/bar/baz.html'),
(None, '/foo/bar/baz.html'))
self.assertEqual(splithost('//127.0.0.1#@host.com'),
('127.0.0.1', '/#@host.com'))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see this corner case tested in other functions like urlparse().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test a comment which contains ":"?

@@ -982,6 +982,8 @@ def test_splithost(self):
('www.example.org:80', ''))
self.assertEqual(splithost('/foo/bar/baz.html'),
(None, '/foo/bar/baz.html'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a reference to bpo-30500:

# bpo-30500: comment which contains @ must not be parsed as a username:password

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You misunderstood the issue here. The issue is the precedence of fragments and basic authentication info. I don't think there is a concept of a comment in URL.

@postmasters
Copy link
Contributor Author

@vadmium I'm unsure of deprecation policy. If a function is deprecated, do we care if its behavior changes? After all, it's going to be removed (the purpose of deprecation), right? Others should normally not depend on deprecated functions. Besides, do we want to keep the broken behavior (as compared to regular web browsers) of a deprecated function? Also, where does it say that these functions are deprecated?

My plan is to base all these split* on urlparse. Then any fix to urlparse is automatically enjoyed by these functions. But I'm sure this hasty fix might break some other code. I'd love to hear your suggestion, Martin.

@postmasters
Copy link
Contributor Author

@Haypo any further comment?

@vadmium
Copy link
Member

vadmium commented Jun 3, 2017

The documentation doesn’t say much about these functions. The best I know of is at the bottom of https://docs.python.org/2.7/library/urllib.html#utility-functions, which “recommends” against directly using the functions. But judging by https://bugs.python.org/issue27485 I think there is a consensus to formally deprecate them. I think there are policies about waiting one or more versions between deprecation and removal, and maintaining compatibility with 2.7 until it is dead.

I think it is best to avoid changing behaviour of public APIs, including deprecated APIs, unless the behaviour is incorrect. So it may be okay to fix the splithost("//abc#@def") behaviour, but avoid spurious changes to corner cases if practical. Although I haven’t actually tried your code to see if it changes corner cases, or thought if there are other practical ways fix the bug.

@postmasters
Copy link
Contributor Author

@vadmium so it looks like you're half-okay with this, pending on corner cases. The best I know is all existing test cases, including a few new ones, pass. That kinda gives some sort of assurance. Do let me know if you find out otherwise.

Other reviewers, could I have a pair of eyes on this again please? Thanks so much.

@vadmium
Copy link
Member

vadmium commented Jun 5, 2017

Can you try out the following test cases (or even add them to the test suite if they don’t exist) with your implementation? The results I give are from the existing Python 3.5 implementation:

>>> urllib.parse.splithost("///file")  # Is the host the empty string or None?
('', '/file')
>>> urllib.parse.splithost("//example.net/file;")  # Is the semicolon retained?
('example.net', '/file;')
>>> urllib.parse.splithost("//example.net/file?")  # Question mark retained?
('example.net', '/file?')
>>> urllib.parse.splithost("//example.net/file#")  # Hash symbol retained?
('example.net', '/file#')

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change lacks an entry in Misc/NEWS. You may add a [Security] prefix in your entry.

@vstinner
Copy link
Member

[ First-time contributor ]

Oh, it's your first contrib to CPython: in that case, please also add yourself to Misc/ACKS.

@vstinner
Copy link
Member

Since this change fixes a major security vulnerability, I took the liberty of pushing directly to your repository @postmasters. I hope that it is ok with you. I would like to fix this issue quickly.

@vstinner vstinner dismissed their stale review June 14, 2017 16:20

I made the changes that I requested.

@vstinner vstinner changed the title bpo-30500: urllib: Simplify splithost by calling into urlparse. bpo-30500: Fix urllib.parse.splithost() to parse correctly fragments Jun 14, 2017
@postmasters
Copy link
Contributor Author

Not at all. Thanks for doing that, @Haypo. I still don't know how best to address @vadmium test cases. They seem to be related, yet unrelated. I don't really know if the difference in behavior is important enough.

@vstinner
Copy link
Member

@vadmium: "Can you try out the following test cases (or even add them to the test suite if they don’t exist) with your implementation? The results I give are from the existing Python 3.5 implementation:"

I added your splithost() test cases to test_urlparse, but the results are different from Python 3.5. Is it a backward incompatible change? Do you see a way to fix it?

Copy link
Member

@vadmium vadmium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment against the existing regular expression for my (untested) solution

Misc/NEWS Outdated
@@ -362,6 +362,11 @@ Extension Modules
Library
-------

- [Security] bpo-30500: Fix urllib.parse.splithost() to parse correctly
fragments. For example, ``splithost('http://127.0.0.1#@evil.com/')`` now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. . . to correctly parse fragments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

def splithost(url):
"""splithost('//host[:port]/path') --> 'host[:port]', '/path'."""
global _hostprog
if _hostprog is None:
_hostprog = re.compile('//([^/?]*)(.*)', re.DOTALL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW my first stop for a minimal change with the desired behaviour would be adjusting this line: change [^/?] to [^/?#]. I haven’t actually tested this though. It would be similar to how https://bugs.python.org/issue1457264 was fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done by @postmasters

path = '/' + path
return host_port, path
return None, url
fields = urlparse(url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour change for a semicolon (;) with no parameters could be avoided by using urlsplit and urlunsplit rather than urlparse and urlunparse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@postmasters rewrote this change to only modify the regex, so it doesn't change the behaviour anymore.

@vadmium
Copy link
Member

vadmium commented Jun 17, 2017

If someone was relying on splithost to faithfully retain the original URL’s path + parameters + query + fragment, dropping empty components may not be compatible. I don’t know of a specific use case, but it seems plausable that this could e.g. break code that maintains a cache based on the URL, if it cannot find the cache because part of the URL was stripped. Also, changing the data type for empty hosts from str to None seems problematic.

Also, being unable to distinguish an empty host (e.g. ////evil.com with four slashes) from a missing host (/path) has been reported as a security problem: https://bugs.python.org/issue23505. It would be good to avoid opening one hole as we close another hole.

@vstinner
Copy link
Member

vstinner commented Jun 17, 2017 via email

@postmasters
Copy link
Contributor Author

The regex fix is the quickest and easiest to pull. But I'm afraid there might be further corner cases that we are not aware of. Besides, having different versions of utilities like this (split* and urlsplit/urlparse) clashes with "there should be one obvious way".

I really do think that split* functions should preferably be refactored to use urlparse rather than continuing down the regex path.

Then again, we could also fix the regex first, then work on refactoring later too.

@vstinner
Copy link
Member

vstinner commented Jun 17, 2017 via email

@vadmium
Copy link
Member

vadmium commented Jun 17, 2017

@Haypo, I probably won’t be making a pull request in the near future (limited time & priorities). But I am happy if you or someone else makes the change, as long as it works.

@postmasters
Copy link
Contributor Author

I've updated the code to do what @vadmium suggested. Please take another look. Thanks.

The current regex based splitting produces a wrong result. For example::

  http://abc#@def

Web browsers parse that URL as ``http://abc/#@def``, that is, the host
is ``abc``, the path is ``/``, and the fragment is ``#@def``.
@vstinner
Copy link
Member

Oh nice @postmasters, you rewrite your urllib change to only modify the regex: I like the new version! I took the liberty of fixing @vadmium requested change on the NEWS entry (parse correctly => correctly parse), since I wrote the NEWS entry and that I would like to get this change merged as soon as possible.

I will take care of backports.

@vstinner vstinner merged commit 90e01e5 into python:master Jun 20, 2017
@bedevere-bot
Copy link

GH-2289 is a backport of this pull request to the 3.6 branch.

@bedevere-bot
Copy link

GH-2290 is a backport of this pull request to the 3.5 branch.

vstinner added a commit that referenced this pull request Jun 20, 2017
… (#2290)

The current regex based splitting produces a wrong result. For example::

  http://abc#@def

Web browsers parse that URL as ``http://abc/#@def``, that is, the host
is ``abc``, the path is ``/``, and the fragment is ``#@def``.
(cherry picked from commit 90e01e5)
vstinner added a commit that referenced this pull request Jun 20, 2017
… (#2289)

The current regex based splitting produces a wrong result. For example::

  http://abc#@def

Web browsers parse that URL as ``http://abc/#@def``, that is, the host
is ``abc``, the path is ``/``, and the fragment is ``#@def``.
(cherry picked from commit 90e01e5)
@bedevere-bot
Copy link

GH-2294 is a backport of this pull request to the 2.7 branch.

vstinner added a commit that referenced this pull request Jun 20, 2017
… (#2294)

The current regex based splitting produces a wrong result. For example::

  http://abc#@def

Web browsers parse that URL as ``http://abc/#@def``, that is, the host
is ``abc``, the path is ``/``, and the fragment is ``#@def``.
(cherry picked from commit 90e01e5)
ned-deily pushed a commit to ned-deily/cpython that referenced this pull request Jul 7, 2017
…on#1849) (python#2289)

The current regex based splitting produces a wrong result. For example::

  http://abc#@def

Web browsers parse that URL as ``http://abc/#@def``, that is, the host
is ``abc``, the path is ``/``, and the fragment is ``#@def``.
(cherry picked from commit 90e01e5)
(cherry picked from commit 536c1f1)
larryhastings pushed a commit that referenced this pull request Jul 12, 2017
… (#2291)

The current regex based splitting produces a wrong result. For example::

  http://abc#@def

Web browsers parse that URL as ``http://abc/#@def``, that is, the host
is ``abc``, the path is ``/``, and the fragment is ``#@def``.
(cherry picked from commit 90e01e5)
ned-deily pushed a commit that referenced this pull request Jul 26, 2017
…#1849) (#2292)

The current regex based splitting produces a wrong result. For example::

  http://abc#@def

Web browsers parse that URL as ``http://abc/#@def``, that is, the host
is ``abc``, the path is ``/``, and the fragment is ``#@def``.
(cherry picked from commit 90e01e5)
(cherry picked from commit cc54c1c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-security A security issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants