-
Hi! I have some questions about
The problem I think is that I don't understand very well the "trans" and "speculative" hops, even that I've read the wiki post about it. Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
The first important thing to understand is that TransclusionDecideRule as used in the default config is an ACCEPT rule not a REJECT rule. This means it allows URIs that would otherwise be rejected to be accepted. In other words it strictly only widens the scope. If a URI is already accepted due to another rule such as by being in SURT scope it will have no effect on it. For the purposes of the maxTransHops setting a transclusion hop is any hop that is not a regular navigation link ('L'), a form submission ('S') or a site-map ('M') link. A speculative hop (X) is where Heritrix finds a something that looks like a URL in JavaScript source. Heritrix is not able to understand JavaScript code so it's speculating that it might be a URL based on some simple heuristics.
In the default configuration if maxTransHops to 0 then any URIs outside the SURT scope will be excluded. For example imagine a page at http://site1/pages/home.html containing: <img src=/pages/1.jpg>
<img src=/images/2.jpg>
<img src=http://site2/3.jpg>
<a href=/pages/4.html>
<a href=/other/5.html>
<a href=http://site2/6.html> If maxTransHops is default value of 2 then the following URIs will be visited:
If maxTransHops is 0 then the following URIs will be visited:
Setting maxSpeculativeHops to 0 will mean the TransclusionDecideRule will have no effect on URIs with X (speculative JavaScript) hops anywhere in their hop path (as defined above). In the example I showed above it will have no impact at all as there's no JavaScript in the page.
I don't really understand this question. The purpose of the TransclusionDecideRule in the default config is to capture transcluded content such as images, stylesheets needed to render a HTML page that's in scope even if the transcluded content itself would otherwise be out of scope. My assumption is that the purpose of the maxSpeculativeHops setting is to prevent it from doing this when too much speculative JavaScript extraction is involved as speculative hops can often themselves be HTML pages (error pages etc) which can result in the inclusion of a whole chain of irrelevant junk. But I'm just guessing. I don't think I would have designed it like that myself.
Setting maxTransHops to 0 is effectively the same as disabling the TransclusionDecideRule entirely. It means scope will be determined strictly by the other decide rules which in the default config means only the acceptSurts rule and PrerequisiteAcceptDecideRule (robots.txt and DNS fetches).
I'm not surprised as that wiki page seems wrong or at least very misleading when it comes speculative hops and the maxSpeculativeHops setting. (I know it has my name on it, but that's because I migrated it from an older wiki. I didn't write most of the wiki pages.) |
Beta Was this translation helpful? Give feedback.
-
It's hard to explain the relationship between the two settings in prose. If you can read the Java code I recommend that: // too many speculative hops disqualify from transclusion
if (specCount > getMaxSpeculativeHops()) {
return false;
}
// transclusion applies as long as non-ref hops less than max
return nonrefCount <= getMaxTransHops(); maxSpeculativeHops is not independent of maxTransHops. If maxTransHops is 0 the rule will never match even if maxSpeculativeHops is positive. |
Beta Was this translation helpful? Give feedback.
-
First things first: thank you for the detailed explanation! :) I've been analyzing the code, and I think I don't understand very well the example you proposed. If I don't understand wrong, if we set "http://site1/pages/home.html" as seed, the hop path would be "L", and the rest of links which are imges would be "LE" (link + embed), and the links would be "LL" (link +link). Then,
I understand why you said that It is very likely that I didn't understand the explanation even that it was very good! Now I understand the 2 different types of hops which are used in this module. Thank you! Now I understand that some questions I made had not sense at all. Sorry for that. I though that this decide module was intended for something else, but now I get it is intended for try to get URIs from embeded elements or similar and the speculative limit is just for try to do not go too far speculating. |
Beta Was this translation helpful? Give feedback.
-
I've noticed that in the defult configuration it is set that: <!-- ...but REJECT those more than a configured link-hop-count from start... -->
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
<!-- <property name="maxHops" value="20" /> -->
</bean>
<!-- ...but ACCEPT those more than a configured link-hop-count from start... -->
<bean class="org.archive.modules.deciderules.TransclusionDecideRule"> Since the crawl scope rules are applied sequentially, can't happen that |
Beta Was this translation helpful? Give feedback.
-
My apologies. I neglected to mention the SURT scope in my example. If the seed URL was
Huh. That's a good point. So you're suggesting a hop path like
My guess is that this is intentional. It's not uncommon to do shallow crawls by setting maxHops to a small value like 1 or 2 and in the typical web archiving use case Heritrix was designed for you'd not want to capture HTML pages without the embedded resources needed to render them. |
Beta Was this translation helpful? Give feedback.
-
Ohh, ok, I think I got it! So,
Now everything makes sense!
Oh, ok, I hadn't put it that way. Since the intention of internet archive is to render the crawled content, makes all the sense to be in the exact order it is. I was thinking in the case where all you need is text, which is my use case, and I don't think I need to download extra content. Thank you for all the provided support! It's helped me a lot in order to understand a little bit better the way the URIs are either accepted or rejected in Heritrix! |
Beta Was this translation helpful? Give feedback.
The first important thing to understand is that TransclusionDecideRule as used in the default config is an ACCEPT rule not a REJECT rule. This means it allows URIs that would otherwise be rejected to be accepted. In other words it strictly only widens the scope. If a URI is already accepted due to another rule such as by being in SURT scope it will have no effect on it.
For the purposes of the maxTransHops setting a transclusion hop is any hop that is not a regular navigation link ('L'), a form submission ('S') or a site-map ('M') link.
A speculative hop (X) is where Heritrix finds a something that looks like a URL in JavaScript source. Heritrix is not able to understand JavaScript code s…