-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use html5lib to parse HTML instead of using regexs #973
Conversation
This is broken into 2 commits to make it easier to review. The first commit deals only with actually vendoring html5lib (and six because it's a dep of html5lib) and the second commit is the actual changes to pip. |
HTML is not a regular language and thus cannot be accurately parsed with a regex. The regexs that pip had were gnarly, and required multiple levels of them in order to "work". Using html5lib instead of regexs makes the parsing step simpler, better tested, and much more robust. There is still some use of regex to parse html. Namely the hack that looks for a <th> tag with a "rel" after it. This hack is required to work inside of HTML comments so it cannot be parsed as HTML.
@mitsuhiko on IRC noted that this may break some peoples internal deploys. For what it's worth, html5lib is designed to implement parsing the same way as the browsers do, and I've yet to find something it choked on. As far as I can tell something would need to have html so malformed that a browser is unlikely to be able to render it. The XPath queries are very simple (Give me anything with the a tag) so there should not be any assumption of structure. I do not believe there will be any major issue by merging this, however I wanted to note @mitsuhiko's concern for other reviewers. |
from __future__ import absolute_import | ||
|
||
# Monkeypatch pip.vendor.six into just six | ||
# This is kind of terrible, however it is the least bad of 3 bad options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (2), not (3), is the least bad of these options. Changing imports of six within html5lib is a simple, mechanical, easily-automated modification to upstream. It is not a significant barrier to upgrading the vendored html5lib in the future (just copy in the new html5lib and run a simple search-and-replace to change the imports), and certainly does not "pave the way for us to be in charge of maintaining it." Messing with sys.modules
like this is much, much worse; people do import and use pip as an API, and so all the issues with (1) will still be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be a bad idea to actually include a script that can do the search and replace correctly on the html5lib imports of six, for the benefit of future upgraders of html5lib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opted not to include a script simply because I plan to write a utility to handle vendoring things better and I'll just include this in that rather then attempt to make a one off script.
* It was determined that simply rewriting the imports was less invasive than patching sys.modules.
Use html5lib to parse HTML instead of using regexs
HTML is not a regular language and thus cannot be accurately parsed
with a regex. The regexs that pip had were gnarly, and required
multiple levels of them in order to "work".
Using html5lib instead of regexs makes the parsing step simpler,
better tested, and much more robust.
There is still some "HTML" being parsed by regex, but there's not
a whole lot that can be done about that bit except remove it
completely. It depends on applying significance to html inside
of an html comment and is really horrible.