You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently I'm having a problem, but I'm not sure what I should focus on or whether it is a complex of problems. I'll try to explain what I'm encountering and I'd be thankful if you leave some comments on that, because maybe it's not even a bug.
Okay, for example, let's crawl 'http://golang.org' : if you look at the golang.org source code, you'll see links like: /pkg/, /doc/, etc.
These links are getting resolved to absolute and normalized by gocrawl, so for example for /pkg/ I get 'http://golang/pkg' (Default purell flag is 'all greedy' so I lose the trailing slash).
If you visit 'http://golang/pkg' (even just using your browser) you'll see that it would redirect you to '/pkg/' (Just where the initial link goes).
First problem
And here goes the first problem, which is depicted by a piece of gocrawl log (I removed unneccessary log parts):
So it seems that redirected URL doesn't get resolved to absolute one like the original one was. I checked your code and saw resolving logic only in worker.processLinks if I'm not mistaken. So it seems that somewhere in the redirect logic resolving is missing.
Second problem
Even if the redirect URL would get resolved to an absolute one, it still gets normalized if I don't change URLNormalizationFlags. So the trailing slash would still be always removed (We see that in log we 'receive /pkg/' and 'ignore /pkg') and thus we'll be infinitely redirected, because golang.org/pkg redirects to golang.org/pkg/ (and after normalization it gets to golang.org/pkg and it redirects to ....).
My temporary solution
I've temporarily solved that just by avoiding any slash-related logic, so I've set
Maybe some other normalization flag should be chosen as the default?
Or maybe it is even better to change the strategy a bit:
Pass a normalized URL to filter,
After Filter returned 'true', fetch the original URL as-is
Personally I like the latter, because this way I exclude the situation that normalization changes URL and website gives something different for the modified one (like a redirect to the original again).
What do you think? Tell me if I'm missing something here.
The text was updated successfully, but these errors were encountered:
Thanks for the detailed information. I did run into something similar (normalization made the request fail, the website did not allow non-www), and I used a different normalization to make it work, but the website did not redirect to the original (non-normalized) URL, so I didn't think about this possible circular problem. Your proposal makes sense as far as I'm concerned.
As for the redirect, I wrongly assumed that the new location was always absolute.
Let me check this all in context in the coming days, but this feels right.
Hi Martin!
Currently I'm having a problem, but I'm not sure what I should focus on or whether it is a complex of problems. I'll try to explain what I'm encountering and I'd be thankful if you leave some comments on that, because maybe it's not even a bug.
Okay, for example, let's crawl 'http://golang.org' : if you look at the golang.org source code, you'll see links like: /pkg/, /doc/, etc.
These links are getting resolved to absolute and normalized by gocrawl, so for example for /pkg/ I get 'http://golang/pkg' (Default purell flag is 'all greedy' so I lose the trailing slash).
If you visit 'http://golang/pkg' (even just using your browser) you'll see that it would redirect you to '/pkg/' (Just where the initial link goes).
First problem
And here goes the first problem, which is depicted by a piece of gocrawl log (I removed unneccessary log parts):
So it seems that redirected URL doesn't get resolved to absolute one like the original one was. I checked your code and saw resolving logic only in worker.processLinks if I'm not mistaken. So it seems that somewhere in the redirect logic resolving is missing.
Second problem
Even if the redirect URL would get resolved to an absolute one, it still gets normalized if I don't change URLNormalizationFlags. So the trailing slash would still be always removed (We see that in log we 'receive /pkg/' and 'ignore /pkg') and thus we'll be infinitely redirected, because golang.org/pkg redirects to golang.org/pkg/ (and after normalization it gets to golang.org/pkg and it redirects to ....).
My temporary solution
I've temporarily solved that just by avoiding any slash-related logic, so I've set
and everything went fine.
Fix proposal and discussion
Maybe some other normalization flag should be chosen as the default?
Or maybe it is even better to change the strategy a bit:
Personally I like the latter, because this way I exclude the situation that normalization changes URL and website gives something different for the modified one (like a redirect to the original again).
What do you think? Tell me if I'm missing something here.
The text was updated successfully, but these errors were encountered: