-
Notifications
You must be signed in to change notification settings - Fork 106
Description
URLs on some sites erroneously contain valid "safe" characters in an invalid way and the standard Python library is unable to deal with this; therefore it might be nice if w3lib could. For example the hash character # normally marks the beginning of the fragment; but, it is possible that the url contains Numeric Character References (NCRs) like ® for example.
w3lib.url.safe_url_string() uses urllib.quote() with the following safe chars:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-%;/?:@&=+$|,#-_.!~*'()
and the following (invalid) url does not get altered:
>>> url = "/Pioneer_Speakers_with_iPod®~iPhone®_Dock?id=123#ipad"
>>> assert url == safe_url_string(url)
>>>
urlparse.urldefrag() is confused:
>>> urlparse.urldefrag(url)
('/Pioneer_Speakers_with_iPod®~iPhone&', '174;_Dock?id=123#ipad')
Since safe_url_string() is used in SgmlLinkExtractor, for example with canonicalization turned on, we get fragment misinterpretation as the first hash triggers the slice:
/Pioneer_Speakers_with_iPod®~iPhone&
Using urllib.quote() directly does not work since it encodes all hashes, including the fragment hash:
>>> print urllib.quote(url)
/Pioneer_Speakers_with_iPod%26reg%3B%7EiPhone%26%23174%3B_Dock%3Fid%3D123%23ipad
What is needed is perhaps a Entity/NCR regex that first converts the references and then does the safe encoding. So that in the end we get:
/Pioneer_Speakers_with_iPod%26reg%3B~iPhone%26%23174%3B_Dock?id=123#ipad