-
I am aware of the documentation on "Common Heritrix Use Cases" in the wiki to mirror only html files or exclude rich media. Still, I don't get my job to work that should simply not download and / or write to warc PDF-files (and the few ZIPs). The site I am crawling has tons of PDF-files in databases (meeting notes, government decisions, policy reports, etc.), I want to safely exlude them. So, what usually works, is this bean: <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="listLogicalOr" value="true" />
<property name="regexList"> <!-- Liste anpassen nach Log-Analyse, ev. in externe Datei verlagern -->
<list>
<value>.*\.[Pp][Dd][Ff]$</value>
</list>
</property>
</bean> But this only excludes downloads based on file endings. So I added another Rule: <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regex" value="^application\/[pz][di][fp]$"/>
</bean> This has no effect, no entries in In
My last idea was to reject them on write, i.e. add the following property to the warcWriter bean: <property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regex" value="^application\/[pz][di][fp]$"/>
</bean>
</property> This has no effect. Help would be very much appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments
-
Hi @oschihin ,
With this configuration the REJECTs should be logged in warcWriterScope.log. You would still see a 200 for these URLs in the crawl.log because they will be requested to determine what the content type is. This is also why putting that rule in the initial "scope" DecideRuleSequence doesn't prevent crawling of the URLs--the content type isn't known at that point. |
Beta Was this translation helpful? Give feedback.
-
As an optimization to save downloading the full PDFs it seems you can also configure Heritrix to do a midfetch abort after receiving the response header with the FetchHTTP shouldFetchBodyRule property. I haven't tried this so I'm uncertain whether the partial record still gets written to the WARC - if so it would need to be used in conjunction with the WarcWriter shouldProcessRule as in ldko's example above. <bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
<!-- Avoid downloading the response body for resources we're not going to keep. -->
<property name="shouldFetchBodyRule">
<ref bean="warcWriterScope"/>
</property>
</bean> |
Beta Was this translation helpful? Give feedback.
-
Thanks for your hints, and they sounded very promising. But I spent a few hours testing and it simply does not work. I configured the following, with different regex options see this gist for full config On top level <bean id="warcWriterScope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="logToFile" value="true" />
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.AcceptDecideRule">
</bean>
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regex" value="html" />
<!-- <property name="regex" value="^application/[pz][di][fp]$"/> -->
</bean>
</list>
</property>
</bean>
...
...
<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
<!-- Avoid downloading the response body for resources we're not going to keep. -->
<property name="shouldFetchBodyRule">
<ref bean="warcWriterScope"/>
</property>
</bean>
...
...
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterChainProcessor">
<property name="compress" value="true" />
<property name="prefix" value="StABS-WA" />
<property name="maxFileSizeBytes" value="1073741824" /> <!-- 1 GB -->
<!-- <property name="poolMaxActive" value="1" /> -->
<!-- <property name="MaxWaitForIdleMs" value="500" /> -->
<property name="skipIdenticalDigests" value="true" />
<property name="maxTotalBytesToWrite" value="107374182400" /> <!-- 100GB als Sicherheitsmassnahme -->
<property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
<property name="shouldProcessRule">
<!-- ...Add scope to limit what is written to WARCs... -->
<ref bean="warcWriterScope"/>
</property>
</bean>
Questions
|
Beta Was this translation helpful? Give feedback.
-
I think I worked out a real URL, and I checked the content type in the response:
i.e. responses are using type parameters to indicate character set, and the module just uses the whole Content-Type string, so the RegEx has to account for that. I'm guessing these RegExen should work (if I'm remembering my syntax correctly):
I think this is all consistent with what I subsequently found here: https://stackoverflow.com/questions/3493786/how-do-i-exclude-everything-but-text-html-from-a-heritrix-crawl However, the EDIT hmm, the
So maybe the |
Beta Was this translation helpful? Give feedback.
-
@anjackson this is what I stumbled upon yesterday before falling asleep. Now I ran a test on a simpler website and excluded jpeg. You are absolutely right:
Exclude
|
Beta Was this translation helpful? Give feedback.
-
Thanks everybody, this worked, with some follow up questions that I will ask in another ticket. I'll close here. |
Beta Was this translation helpful? Give feedback.
-
Hi! Should this considered the "right" method for avoiding a specific type content, @ato? Is there not other easier/intuitive method? Is there a more generic way to only download text which doesn't involve to identify all the content-type related to text (e.g. text/html, text/plain)? |
Beta Was this translation helpful? Give feedback.
-
I have had no need for this in my own work with Heritrix so it's not something I've thought a lot about but it seems like the most reasonable approach to strictly blocking PDFs.
You could prevent the following of embed links, i.e. those discovered via
This should exclude a lot of it but obviously with this rule it's still possible for non-text URIs to be visited if they're linked via regular navigation link such as |
Beta Was this translation helpful? Give feedback.
Hi @oschihin ,
I think you are on the right track. You should be able to reject the mimetypes in the warcWriter bean. This works for me to reject
image/jpeg
types: