Closed
Description
In w3lib.html
regular expressions are used to parse HTML texts:
_ent_re = re.compile(r'&((?P<named>[a-z\d]+)|#(?P<dec>\d+)|#x(?P<hex>[a-f\d]+))(?P<semicolon>;?)', re.IGNORECASE)
_tag_re = re.compile(r'<[a-zA-Z\/!].*?>', re.DOTALL)
_baseurl_re = re.compile(six.u(r'<base\s[^>]*href\s*=\s*[\"\']\s*([^\"\'\s]+)\s*[\"\']'), re.I)
_meta_refresh_re = re.compile(six.u(r'<meta\s[^>]*http-equiv[^>]*refresh[^>]*content\s*=\s*(?P<quote>["\'])(?P<int>(\d*\.)?\d+)\s*;\s*url=\s*(?P<url>.*?)(?P=quote)'), re.DOTALL | re.IGNORECASE)
_cdata_re = re.compile(r'((?P<cdata_s><!\[CDATA\[)(?P<cdata_d>.*?)(?P<cdata_e>\]\]>))', re.DOTALL)
However this is definitely incorrect when it involves commented contents, e.g.
>>> from w3lib import html
>>> html.get_base_url("""<!-- <base href="http://example.com/" /> -->""")
'http://example.com/'
Introducing "heavier" utilities like lxml
would solve this issue easily, but that might be an awful idea as w3lib
aims to be lightweight & fast.
Or maybe we could implement some quick parser merely for eliminating the commented parts.
Any ideas?