How to exclude PDF files #528

oschihin · 2021-12-14T10:48:09Z

oschihin
Dec 14, 2021

I am aware of the documentation on "Common Heritrix Use Cases" in the wiki to mirror only html files or exclude rich media. Still, I don't get my job to work that should simply not download and / or write to warc PDF-files (and the few ZIPs). The site I am crawling has tons of PDF-files in databases (meeting notes, government decisions, policy reports, etc.), I want to safely exlude them.

So, what usually works, is this bean:

<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="listLogicalOr" value="true" />
    <property name="regexList"> <!-- Liste anpassen nach Log-Analyse, ev. in externe Datei verlagern -->
        <list>
            <value>.*\.[Pp][Dd][Ff]$</value>
         </list>
  </property>
</bean>

But this only excludes downloads based on file endings. So I added another Rule:

<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="regex" value="^application\/[pz][di][fp]$"/>
</bean>

This has no effect, no entries in scope.log.

In crawl.log I have these entries:

2021-12-07T12:02:26.669Z   200     171746 https://www.government.example.com/geschaefte/regierungsratsbeschluesse.html?previousAction1=geschaeft&previousAction2=&previousAction3=&previousAction4=&action=download&dokumentId=79e664176005402cabea26e8b591cf77-332&dokumentVersion=5&dokumentAnsicht=Dokument&geschaeftId=4cfd6a0d946d41f89794bf7327f89a76 LLLRL https://www.government.example.com/geschaefte/regierungsratsbeschluesse.html?action=geschaeft&geschaeftId=4cfd6a0d946d41f89794bf7327f89a76 application/pdf #010 20211207120226169+466 sha1:5UZWSGMUDEYGYZDENDJZFGTUVJ3BGJFS https://www.government.example.com -

My last idea was to reject them on write, i.e. add the following property to the warcWriter bean:

<property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
<property name="shouldProcessRule">
    <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
        <property name="decision" value="REJECT"/>
	    <property name="regex" value="^application\/[pz][di][fp]$"/>
	</bean>
</property>

This has no effect.

Help would be very much appreciated.

Answered by ldko

Dec 14, 2021

Hi @oschihin ,
I think you are on the right track. You should be able to reject the mimetypes in the warcWriter bean. This works for me to reject image/jpeg types:

 <!-- Define WARC scope at top-level, to enable logging -->
 <bean id="warcWriterScope" class="org.archive.modules.deciderules.DecideRuleSequence">
       <property name="logToFile" value="true" />
       <property name="rules">
         <list>
           <bean class="org.archive.modules.deciderules.AcceptDecideRule">
           </bean>
           <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
             <property name="decision" value="REJECT"/>
             <property name="regex" value="^im…

View full answer

ldko · 2021-12-14T17:48:49Z

ldko
Dec 14, 2021

Hi @oschihin ,
I think you are on the right track. You should be able to reject the mimetypes in the warcWriter bean. This works for me to reject image/jpeg types:

 <!-- Define WARC scope at top-level, to enable logging -->
 <bean id="warcWriterScope" class="org.archive.modules.deciderules.DecideRuleSequence">
       <property name="logToFile" value="true" />
       <property name="rules">
         <list>
           <bean class="org.archive.modules.deciderules.AcceptDecideRule">
           </bean>
           <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
             <property name="decision" value="REJECT"/>
             <property name="regex" value="^image/jpeg$"/>
           </bean>
         </list>
      </property>
  </bean>

 <!-- DISPOSITION CHAIN -->
 <!-- first, processors are declared as top-level named beans  -->
 <bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
       <property name="shouldProcessRule">
         <!-- ...Add scope to limit what is written to WARCs... -->
         <ref bean="warcWriterScope"/>
       </property>

With this configuration the REJECTs should be logged in warcWriterScope.log. You would still see a 200 for these URLs in the crawl.log because they will be requested to determine what the content type is. This is also why putting that rule in the initial "scope" DecideRuleSequence doesn't prevent crawling of the URLs--the content type isn't known at that point.

0 replies

ato · 2021-12-15T05:43:39Z

ato
Dec 15, 2021
Maintainer

As an optimization to save downloading the full PDFs it seems you can also configure Heritrix to do a midfetch abort after receiving the response header with the FetchHTTP shouldFetchBodyRule property. I haven't tried this so I'm uncertain whether the partial record still gets written to the WARC - if so it would need to be used in conjunction with the WarcWriter shouldProcessRule as in ldko's example above.

<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
    <!-- Avoid downloading the response body for resources we're not going to keep. -->
    <property name="shouldFetchBodyRule"> 
       <ref bean="warcWriterScope"/>
    </property>
</bean>

0 replies

oschihin · 2021-12-15T15:53:21Z

oschihin
Dec 15, 2021
Author

Thanks for your hints, and they sounded very promising. But I spent a few hours testing and it simply does not work. I configured the following, with different regex options see this gist for full config

On top level

<bean id="warcWriterScope" class="org.archive.modules.deciderules.DecideRuleSequence">
	<property name="logToFile" value="true" />
	<property name="rules">
		<list>
    		<bean class="org.archive.modules.deciderules.AcceptDecideRule">
    		</bean>
			<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
				<property name="decision" value="REJECT"/>
     			<property name="regex" value="html" />
				<!-- <property name="regex" value="^application/[pz][di][fp]$"/> -->
			</bean>
		</list>
	</property>
</bean>
...
...
<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
    <!-- Avoid downloading the response body for resources we're not going to keep. -->
    <property name="shouldFetchBodyRule"> 
        <ref bean="warcWriterScope"/>
    </property>
</bean>
...
...
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterChainProcessor">
	<property name="compress" value="true" />
	<property name="prefix" value="StABS-WA" />
	<property name="maxFileSizeBytes" value="1073741824" /> <!-- 1 GB -->
	<!-- <property name="poolMaxActive" value="1" /> -->
	<!-- <property name="MaxWaitForIdleMs" value="500" /> -->
	<property name="skipIdenticalDigests" value="true" />
	<property name="maxTotalBytesToWrite" value="107374182400" /> <!-- 100GB als Sicherheitsmassnahme -->
	<property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
	<property name="shouldProcessRule">
		<!-- ...Add scope to limit what is written to WARCs... -->
		<ref bean="warcWriterScope"/>
	</property>
</bean>

I tried regex values like ^application\/[pz][di][fp]$, ^application/pdf$, application/pdf, and to block most, also text/html or simpyl html, none seem to have an effect.
warcWriterScope.log is written, it only has ACCEPT messages

Questions

Is there anything wrong in my configuration setup?
Could this be some regex problem?

0 replies

anjackson · 2021-12-15T22:06:32Z

anjackson
Dec 15, 2021
Maintainer

I think I worked out a real URL, and I checked the content type in the response:

Content-Type: application/pdf;charset=UTF-8

i.e. responses are using type parameters to indicate character set, and the module just uses the whole Content-Type string, so the RegEx has to account for that. I'm guessing these RegExen should work (if I'm remembering my syntax correctly):

^application\/[pz][di][fp].*$
^application\/pdf(|\;+.*)$ (this one forces the ;)

I think this is all consistent with what I subsequently found here: https://stackoverflow.com/questions/3493786/how-do-i-exclude-everything-but-text-html-from-a-heritrix-crawl

However, the html example should really have blocked any content types with 'html' so I guess there is something else wrong. The only way I can think that would happen is if the server was returning weird content types like TEXT/HTML!? Is it possible for Heritrix to interpret these responses with a character set that does not align with the response, to the degree that ASCII characters don't match?!

EDIT hmm, the matches() JavaDoc does say

returns true if, and only if, the entire region sequence matches this matcher's pattern

So maybe the html example needs to be ^.*html.*$ ?

0 replies

oschihin · 2021-12-16T11:02:51Z

oschihin
Dec 16, 2021
Author

@anjackson this is what I stumbled upon yesterday before falling asleep. Now I ran a test on a simpler website and excluded jpeg. You are absolutely right:

Content-Types come with charset indicators, and, in case of http, msgtype. There would also be a boundary directive, see documentation
The regex pattern used must account for the whole sequence.

Exclude `^image\/jpeg.*$`

The resulting WARC-file contains the following content-types, with jpeg missing (first column is count):

380 Content-Type: application/warc-fields
 379 Content-Type: application/http; msgtype=response
 379 Content-Type: application/http; msgtype=request
 254 Content-Type: text/html;charset=UTF-8
  29 Content-Type: text/css;charset=UTF-8
  28 Content-Type: image/png;charset=UTF-8
  17 Content-Type: application/javascript;charset=UTF-8
  11 Content-Type: image/svg+xml;charset=UTF-8
   8 Content-Type: text/html; charset=iso-8859-1
   5 Content-Type: image/gif;charset=UTF-8
   4 Content-Type: audio/mpeg;charset=UTF-8
   4 Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;charset=UTF-8
   3 Content-Type: application/vnd.openxmlformats-officedocument.presentationml.presentation;charset=UTF-8
   2 Content-Type: text/dns
   2 Content-Type: application/x-font-woff;charset=UTF-8
   2 Content-Type: application/x-font-ttf;charset=UTF-8
   2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
   2 Content-Type: application/font-woff2;charset=UTF-8
   1 Content-Type: text/xml;charset=UTF-8
   1 Content-Type: text/plain
   1 Content-Type: audio/x-ms-wma;charset=UTF-8
   1 Content-Type: audio/mp4;charset=UTF-8
   1 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
   1 Content-Type: application/vnd.ms-excel;charset=UTF-8
   1 Content-Type: application/msword;charset=UTF-8

Exclude `jpeg`

Here, jpegs are included in the WARC

 516 Content-Type: application/warc-fields
 515 Content-Type: application/http; msgtype=response
 515 Content-Type: application/http; msgtype=request
 254 Content-Type: text/html;charset=UTF-8
 136 Content-Type: image/jpeg;charset=UTF-8
  29 Content-Type: text/css;charset=UTF-8
  28 Content-Type: image/png;charset=UTF-8
  17 Content-Type: application/javascript;charset=UTF-8
  11 Content-Type: image/svg+xml;charset=UTF-8
   8 Content-Type: text/html; charset=iso-8859-1
   5 Content-Type: image/gif;charset=UTF-8
   4 Content-Type: audio/mpeg;charset=UTF-8
   4 Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;charset=UTF-8
   3 Content-Type: application/vnd.openxmlformats-officedocument.presentationml.presentation;charset=UTF-8
   2 Content-Type: text/dns
   2 Content-Type: application/x-font-woff;charset=UTF-8
   2 Content-Type: application/x-font-ttf;charset=UTF-8
   2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
   2 Content-Type: application/font-woff2;charset=UTF-8
   1 Content-Type: text/xml;charset=UTF-8
   1 Content-Type: text/plain
   1 Content-Type: audio/x-ms-wma;charset=UTF-8
   1 Content-Type: audio/mp4;charset=UTF-8
   1 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
   1 Content-Type: application/vnd.ms-excel;charset=UTF-8
   1 Content-Type: application/msword;charset=UTF-8

So: Use that .* in your regex, or be more precise, if you must.

0 replies

oschihin · 2021-12-20T14:17:46Z

oschihin
Dec 20, 2021
Author

Thanks everybody, this worked, with some follow up questions that I will ask in another ticket. I'll close here.

0 replies

cgr71ii · 2022-08-31T19:54:20Z

cgr71ii
Aug 31, 2022

Hi! Should this considered the "right" method for avoiding a specific type content, @ato? Is there not other easier/intuitive method? Is there a more generic way to only download text which doesn't involve to identify all the content-type related to text (e.g. text/html, text/plain)?

0 replies

ato · 2022-09-01T04:33:15Z

ato
Sep 1, 2022
Maintainer

Should this considered the "right" method for avoiding a specific type content, @ato?

I have had no need for this in my own work with Heritrix so it's not something I've thought a lot about but it seems like the most reasonable approach to strictly blocking PDFs.

Is there a more generic way to only download text which doesn't involve to identify all the content-type related to text

You could prevent the following of embed links, i.e. those discovered via <img> and <script> tags by adding a rule to the end of the scope like this:

<bean id="rejectEmbeds" class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
    <property name="regex" value=".*E.*"/>
    <property name="decision" value="REJECT"/>
</bean>

This should exclude a lot of it but obviously with this rule it's still possible for non-text URIs to be visited if they're linked via regular navigation link such as <a href=foo.jpg>. So if you need to be strict about it then this would need to be used in combination with a shouldFetchBodyRule and WarcWriter shouldProcessRule as dicussed above to select the specific content-types you want to keep.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to exclude PDF files #528

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to exclude PDF files #528

oschihin Dec 14, 2021

Replies: 8 comments

ldko Dec 14, 2021

ato Dec 15, 2021 Maintainer

oschihin Dec 15, 2021 Author

Questions

anjackson Dec 15, 2021 Maintainer

oschihin Dec 16, 2021 Author

Exclude ^image\/jpeg.*$

Exclude jpeg

oschihin Dec 20, 2021 Author

cgr71ii Aug 31, 2022

ato Sep 1, 2022 Maintainer

oschihin
Dec 14, 2021

ldko
Dec 14, 2021

ato
Dec 15, 2021
Maintainer

oschihin
Dec 15, 2021
Author

anjackson
Dec 15, 2021
Maintainer

oschihin
Dec 16, 2021
Author

Exclude `^image\/jpeg.*$`

Exclude `jpeg`

oschihin
Dec 20, 2021
Author

cgr71ii
Aug 31, 2022

ato
Sep 1, 2022
Maintainer