Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should be possible to use Ruby's URI implementation instead of Addressable::URI #27

Open
gh2k opened this issue May 13, 2014 · 3 comments

Comments

@gh2k
Copy link
Contributor

gh2k commented May 13, 2014

Sometimes Addressable::URI mangles urls to something incorrect. See: sporkmonger/addressable#160

When cobweb crawls one of these, the correct URL is put into redis, but when normalized it hits a 404. Examples would be any URI containing "%e2%80%b3"

Ruby's URI implementation doesn't do this. It would be nice to have an option of using this class instead.

@stewartmckee
Copy link
Owner

Thanks, hopefully they've fixed the ruby's uri implementation, but it was badly broken some years ago, since then i've used addressable as its been the most reliable, but would be good to add some options or even switch to ruby's uri if it is now true to the rfc.

@gh2k
Copy link
Contributor Author

gh2k commented May 14, 2014

Do you know a good way to test the URI implementation in 2.1 against the RFC?

@sporkmonger
Copy link

Just closed sporkmonger/addressable#160 as "won't fix", but wanted to comment here because I suspect cobweb is actually misusing Addressable, given that this issue came up.

The uri.normalize method's output should not generally be used directly to query a web server. Instead, you want to use it as more of a lookup key for caches or previously crawled URLs, etc. So you'd use uri to query your web server and uri.normalize to record the output the web server gives you back. But you'd never want to make a request to the web server with the output of uri.normalize because it's an intentionally lossy operation that's primarily meant for lookups and equality testing (where both sides of the URI equality being tested get normalized and then compared). Probably 95+% of the time it'll work fine to do it the wrong way, but then every once in awhile it'll bite you.

Sorry about getting to this so late after the issue was opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants