It's not a good idead to parse HTML text using regular expressions

In [`w3lib.html`](https://github.com/scrapy/w3lib/blob/master/w3lib/html.py) regular expressions are used to parse HTML texts:

``` python
_ent_re = re.compile(r'&((?P<named>[a-z\d]+)|#(?P<dec>\d+)|#x(?P<hex>[a-f\d]+))(?P<semicolon>;?)', re.IGNORECASE)
_tag_re = re.compile(r'<[a-zA-Z\/!].*?>', re.DOTALL)
_baseurl_re = re.compile(six.u(r'<base\s[^>]*href\s*=\s*[\"\']\s*([^\"\'\s]+)\s*[\"\']'), re.I)
_meta_refresh_re = re.compile(six.u(r'<meta\s[^>]*http-equiv[^>]*refresh[^>]*content\s*=\s*(?P<quote>["\'])(?P<int>(\d*\.)?\d+)\s*;\s*url=\s*(?P<url>.*?)(?P=quote)'), re.DOTALL | re.IGNORECASE)
_cdata_re = re.compile(r'((?P<cdata_s><!\[CDATA\[)(?P<cdata_d>.*?)(?P<cdata_e>\]\]>))', re.DOTALL)
```

However this is definitely incorrect when it involves commented contents, e.g.

``` python
>>> from w3lib import html
>>> html.get_base_url("""""")
'http://example.com/'
```

Introducing "heavier" utilities like `lxml` would solve this issue easily, but that might be an awful idea as `w3lib` aims to be lightweight & fast.  
Or maybe we could implement some quick parser merely for eliminating the commented parts.

Any ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

It's not a good idead to parse HTML text using regular expressions #70

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

It's not a good idead to parse HTML text using regular expressions #70

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions