Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redirect + normalization problem #10

Closed
goodsign opened this issue Jan 13, 2013 · 1 comment
Closed

Redirect + normalization problem #10

goodsign opened this issue Jan 13, 2013 · 1 comment

Comments

@goodsign
Copy link

Hi Martin!

Currently I'm having a problem, but I'm not sure what I should focus on or whether it is a complex of problems. I'll try to explain what I'm encountering and I'd be thankful if you leave some comments on that, because maybe it's not even a bug.

Okay, for example, let's crawl 'http://golang.org' : if you look at the golang.org source code, you'll see links like: /pkg/, /doc/, etc.

These links are getting resolved to absolute and normalized by gocrawl, so for example for /pkg/ I get 'http://golang/pkg' (Default purell flag is 'all greedy' so I lose the trailing slash).

If you visit 'http://golang/pkg' (even just using your browser) you'll see that it would redirect you to '/pkg/' (Just where the initial link goes).

First problem

And here goes the first problem, which is depicted by a piece of gocrawl log (I removed unneccessary log parts):

enqueue: http://golang.org/pkg
...
worker 1 - popped: http://golang.org/pkg
...
worker 1 - redirect to /pkg/
...
receive url /pkg/
ignore on absolute policy: /pkg

So it seems that redirected URL doesn't get resolved to absolute one like the original one was. I checked your code and saw resolving logic only in worker.processLinks if I'm not mistaken. So it seems that somewhere in the redirect logic resolving is missing.

Second problem

Even if the redirect URL would get resolved to an absolute one, it still gets normalized if I don't change URLNormalizationFlags. So the trailing slash would still be always removed (We see that in log we 'receive /pkg/' and 'ignore /pkg') and thus we'll be infinitely redirected, because golang.org/pkg redirects to golang.org/pkg/ (and after normalization it gets to golang.org/pkg and it redirects to ....).

My temporary solution

I've temporarily solved that just by avoiding any slash-related logic, so I've set

opts.URLNormalizationFlags = purell.FlagsAllGreedy & (^purell.FlagRemoveTrailingSlash)

and everything went fine.

Fix proposal and discussion

Maybe some other normalization flag should be chosen as the default?

Or maybe it is even better to change the strategy a bit:

  • Pass a normalized URL to filter,
  • After Filter returned 'true', fetch the original URL as-is

Personally I like the latter, because this way I exclude the situation that normalization changes URL and website gives something different for the modified one (like a redirect to the original again).

What do you think? Tell me if I'm missing something here.

@mna
Copy link
Member

mna commented Jan 13, 2013

Hi,

Thanks for the detailed information. I did run into something similar (normalization made the request fail, the website did not allow non-www), and I used a different normalization to make it work, but the website did not redirect to the original (non-normalized) URL, so I didn't think about this possible circular problem. Your proposal makes sense as far as I'm concerned.

As for the redirect, I wrongly assumed that the new location was always absolute.

Let me check this all in context in the coming days, but this feels right.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants