Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use html5lib to parse HTML instead of using regexs #973

Merged
merged 4 commits into from
Jun 5, 2013
Merged

Conversation

dstufft
Copy link
Member

@dstufft dstufft commented Jun 2, 2013

HTML is not a regular language and thus cannot be accurately parsed
with a regex. The regexs that pip had were gnarly, and required
multiple levels of them in order to "work".

Using html5lib instead of regexs makes the parsing step simpler,
better tested, and much more robust.

There is still some "HTML" being parsed by regex, but there's not
a whole lot that can be done about that bit except remove it
completely. It depends on applying significance to html inside
of an html comment and is really horrible.

@dstufft
Copy link
Member Author

dstufft commented Jun 2, 2013

This is broken into 2 commits to make it easier to review. The first commit deals only with actually vendoring html5lib (and six because it's a dep of html5lib) and the second commit is the actual changes to pip.

HTML is not a regular language and thus cannot be accurately parsed
with a regex. The regexs that pip had were gnarly, and required
multiple levels of them in order to "work".

Using html5lib instead of regexs makes the parsing step simpler,
better tested, and much more robust.

There is still some use of regex to parse html. Namely the hack
that looks for a <th> tag with a "rel" after it. This hack is
required to work inside of HTML comments so it cannot be parsed
as HTML.
@dstufft
Copy link
Member Author

dstufft commented Jun 2, 2013

@mitsuhiko on IRC noted that this may break some peoples internal deploys.

For what it's worth, html5lib is designed to implement parsing the same way as the browsers do, and I've yet to find something it choked on. As far as I can tell something would need to have html so malformed that a browser is unlikely to be able to render it.

The XPath queries are very simple (Give me anything with the a tag) so there should not be any assumption of structure.

I do not believe there will be any major issue by merging this, however I wanted to note @mitsuhiko's concern for other reviewers.

from __future__ import absolute_import

# Monkeypatch pip.vendor.six into just six
# This is kind of terrible, however it is the least bad of 3 bad options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (2), not (3), is the least bad of these options. Changing imports of six within html5lib is a simple, mechanical, easily-automated modification to upstream. It is not a significant barrier to upgrading the vendored html5lib in the future (just copy in the new html5lib and run a simple search-and-replace to change the imports), and certainly does not "pave the way for us to be in charge of maintaining it." Messing with sys.modules like this is much, much worse; people do import and use pip as an API, and so all the issues with (1) will still be a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be a bad idea to actually include a script that can do the search and replace correctly on the html5lib imports of six, for the benefit of future upgraders of html5lib.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opted not to include a script simply because I plan to write a utility to handle vendoring things better and I'll just include this in that rather then attempt to make a one off script.

* It was determined that simply rewriting the imports was less
  invasive than patching sys.modules.
dstufft added a commit that referenced this pull request Jun 5, 2013
Use html5lib to parse HTML instead of using regexs
@dstufft dstufft merged commit 8344fb8 into develop Jun 5, 2013
@dstufft dstufft deleted the parse-html branch June 5, 2013 18:51
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 5, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants