Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should it be possible to add "depth" in the data hash ? #28

Open
ABrisset opened this issue May 13, 2014 · 3 comments
Open

Should it be possible to add "depth" in the data hash ? #28

ABrisset opened this issue May 13, 2014 · 3 comments

Comments

@ABrisset
Copy link

Hello,

As far as I can see, the generated hash for each page doesn't include the "depth" information, that is to say how many clicks from the homepage each page is distant.
Do you think it could be possible to add this option in the hash ?
By the way, I really appreciate your gem, good work Stewart !

Thanks.

@ABrisset ABrisset changed the title Should it be possible to add "depth" in the date available Should it be possible to add "depth" in the data hash ? May 14, 2014
@stewartmckee
Copy link
Owner

I'm assuming you mean minimum depth. One of the misconceptions with navigation is that there is one way to reach a page. The depth of a page can be different depending on the route you take to get to the page. Also, where is the homepage? Is it the page you started the crawl from or the url with the shortest url?

If we took it as the first page that was crawled and passed a depth number down with the crawl it would not be guaranteed to give accurate results as each page is only processed once, and if there was a page that was linked to from the homepage (depth 1) but was actually crawled based on a sub page of the homepage it would have a depth of 2.

Its something to think about, I suppose if you specified a page as the root and then processed all pages crawled after completion for the shortest route (we have the data for that) then that would give the most accurate results. But again, html navigation is not a tree structure, its a node graph with multiple parents and interconnections.

@nikhgupta
Copy link

Thats correct, and that it would be inaccurate to report depth when processing the content. However, is there a way we can limit the crawl to a certain depth?

Lets say, we start from the seed url, and we only prefer to go 2 pages deep within the navigation. Is that possible with CobWeb? This is certainly possible with Anemone crawler, but it is an old gem, now. I love the way CobWeb uses Sidekiq/Resque jobs, and would really prefer to limit the crawl depth for the crawler.

Between, thanks again for the awesome gem. Really useful.

@colnpanic
Copy link

I agree on both points, this is a really cool gem 👍 and would like to have a "max_depth" option. I totally understand that we're not dealing with tree data and that "depth" is relative, but it would still be useful. The nice thing it would give you is a chance to do a quick test of the "core" links from a page, following just a couple without processing the entire site so you can preview some results without waiting for the whole site to process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants