HTML Entities and Numeric character references in URL

URLs on some [sites](http://www.pcrichard.com/catalog/category.jsp?categoryId=1191&parentCategoryId=7) erroneously contain valid "safe" characters in an invalid way and the standard Python library is unable to deal with this; therefore it might be nice if w3lib could. For example the hash character `#` normally marks the beginning of the _fragment_; but, it is possible that the url contains Numeric Character References (NCRs) like `&#174;` for example.

`w3lib.url.safe_url_string()` uses `urllib.quote()` with the following safe chars:

```
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-%;/?:@&=+$|,#-_.!~*'()
```

and the following (invalid) url does not get altered:

```
>>> url = "/Pioneer_Speakers_with_iPod&reg;~iPhone&#174;_Dock?id=123#ipad"
>>> assert url == safe_url_string(url)
>>> 
```

`urlparse.urldefrag()` is confused:

```
>>> urlparse.urldefrag(url)
('/Pioneer_Speakers_with_iPod&reg;~iPhone&', '174;_Dock?id=123#ipad')
```

Since `safe_url_string()` is used in `SgmlLinkExtractor`, for example with canonicalization turned on, we get fragment misinterpretation as the first hash triggers the slice:

```
/Pioneer_Speakers_with_iPod&reg;~iPhone&
```

Using `urllib.quote()` directly does not work since it encodes all hashes, including the fragment hash:

```
>>> print urllib.quote(url)
/Pioneer_Speakers_with_iPod%26reg%3B%7EiPhone%26%23174%3B_Dock%3Fid%3D123%23ipad
```

What is needed is perhaps a Entity/NCR regex that first converts the references and then does the safe encoding. So that in the end we get:

```
/Pioneer_Speakers_with_iPod%26reg%3B~iPhone%26%23174%3B_Dock?id=123#ipad
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML Entities and Numeric character references in URL #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HTML Entities and Numeric character references in URL #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions