`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered

`canonicalize_url` will decode all percent encoded elements in a string.

if a hash `#` is present as a percent-encoded entity (`%23`) it will be decoded... however it shouldn't be as it is a rfc delimiter and fundamentally changes the URL structure -- turning the subsequent characters into a fragment.   one of the effects is that a canonical url with a safely encoded hash will point to another url; another is that running the canonical on the output will return a different url.

example:

	>>> import w3lib.url
	>>> url = "https://example.com/path/to/foo%20bar%3a%20biz%20%2376%2c%20bang%202017#bash"
	>>> canonical = w3lib.url.canonicalize_url(url)
	>>> print canonical
	https://example.com/path/to/foo%20bar:%20biz%20#76,%20bang%202017
	>>> canonical2 = w3lib.url.canonicalize_url(canonical)
	>>> print canonical2
	https://example.com/path/to/foo%20bar:%20biz%20

what is presented as a fragment in "canonical": `#76,%20bang%202017` is part of the valid url - not a fragment - and is discarded when `canonicalize_url` is run again.

references:
* https://github.com/scrapy/w3lib/issues/5


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered #91

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

canonicalize_url use of safe_url_string breaks when an encoded hash character is encountered #91

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered #91