Skip to content

HTML Entities and Numeric character references in URL #5

@stav

Description

@stav

URLs on some sites erroneously contain valid "safe" characters in an invalid way and the standard Python library is unable to deal with this; therefore it might be nice if w3lib could. For example the hash character # normally marks the beginning of the fragment; but, it is possible that the url contains Numeric Character References (NCRs) like ® for example.

w3lib.url.safe_url_string() uses urllib.quote() with the following safe chars:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-%;/?:@&=+$|,#-_.!~*'()

and the following (invalid) url does not get altered:

>>> url = "/Pioneer_Speakers_with_iPod®~iPhone®_Dock?id=123#ipad"
>>> assert url == safe_url_string(url)
>>> 

urlparse.urldefrag() is confused:

>>> urlparse.urldefrag(url)
('/Pioneer_Speakers_with_iPod®~iPhone&', '174;_Dock?id=123#ipad')

Since safe_url_string() is used in SgmlLinkExtractor, for example with canonicalization turned on, we get fragment misinterpretation as the first hash triggers the slice:

/Pioneer_Speakers_with_iPod®~iPhone&

Using urllib.quote() directly does not work since it encodes all hashes, including the fragment hash:

>>> print urllib.quote(url)
/Pioneer_Speakers_with_iPod%26reg%3B%7EiPhone%26%23174%3B_Dock%3Fid%3D123%23ipad

What is needed is perhaps a Entity/NCR regex that first converts the references and then does the safe encoding. So that in the end we get:

/Pioneer_Speakers_with_iPod%26reg%3B~iPhone%26%23174%3B_Dock?id=123#ipad

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions