Skip to content

It's not a good idead to parse HTML text using regular expressions #70

Closed
@starrify

Description

@starrify

In w3lib.html regular expressions are used to parse HTML texts:

_ent_re = re.compile(r'&((?P<named>[a-z\d]+)|#(?P<dec>\d+)|#x(?P<hex>[a-f\d]+))(?P<semicolon>;?)', re.IGNORECASE)
_tag_re = re.compile(r'<[a-zA-Z\/!].*?>', re.DOTALL)
_baseurl_re = re.compile(six.u(r'<base\s[^>]*href\s*=\s*[\"\']\s*([^\"\'\s]+)\s*[\"\']'), re.I)
_meta_refresh_re = re.compile(six.u(r'<meta\s[^>]*http-equiv[^>]*refresh[^>]*content\s*=\s*(?P<quote>["\'])(?P<int>(\d*\.)?\d+)\s*;\s*url=\s*(?P<url>.*?)(?P=quote)'), re.DOTALL | re.IGNORECASE)
_cdata_re = re.compile(r'((?P<cdata_s><!\[CDATA\[)(?P<cdata_d>.*?)(?P<cdata_e>\]\]>))', re.DOTALL)

However this is definitely incorrect when it involves commented contents, e.g.

>>> from w3lib import html
>>> html.get_base_url("""<!-- <base href="http://example.com/" /> -->""")
'http://example.com/'

Introducing "heavier" utilities like lxml would solve this issue easily, but that might be an awful idea as w3lib aims to be lightweight & fast.
Or maybe we could implement some quick parser merely for eliminating the commented parts.

Any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions