-
I'm trying to add the Chrome Extractor to the default crawl config according to the docs but it fails with Relevant parts of the config: <!-- Chrome Extractor -->
<bean id="extractorChrome" class="org.archive.modules.extractor.ExtractorChrome">
<!-- Set your desired properties for the Chrome extractor -->
<property name="executable" value="chromium-browser" />
<property name="commandLineOptions">
<list>
<value>--headless=new</value>
<value>--remote-allow-origins=*</value>
</list>
</property>
</bean> <bean id="fetchProcessors" class="org.archive.modules.FetchChain">
<property name="processors">
<list>
<!-- re-check scope, if so enabled... -->
<ref bean="preselector"/>
<!-- ...then verify or trigger prerequisite URIs fetched, allow crawling... -->
<ref bean="preconditions"/>
<!-- ...fetch if DNS URI... -->
<ref bean="fetchDns"/>
<!-- <ref bean="fetchWhois"/> -->
<!-- ...fetch if HTTP URI... -->
<ref bean="fetchHttp"/>
<ref bean="extractorChrome" />
<!-- ...extract outlinks from HTTP headers... -->
<ref bean="extractorHttp"/>
<!-- ...extract sitemap urls from robots.txt... -->
<ref bean="extractorRobotsTxt"/>
<!-- ...extract links from sitemaps... -->
<ref bean="extractorSitemap"/>
<!-- ...extract outlinks from HTML content... -->
<ref bean="extractorHtml"/>
<!-- ...extract outlinks from CSS content... -->
<ref bean="extractorCss"/>
<!-- ...extract outlinks from Javascript content... -->
<ref bean="extractorJs"/>
<!-- ...extract outlinks from Flash content... -->
<ref bean="extractorSwf"/>
</list>
</property>
</bean> <!-- CRAWLCONTROLLER: Control interface, unifying context -->
<bean id="crawlController"
class="org.archive.crawler.framework.CrawlController">
<!-- <property name="maxToeThreads" value="25" /> -->
<!-- <property name="pauseAtStart" value="true" /> -->
<!-- <property name="runWhileEmpty" value="false" /> -->
<!-- <property name="recorderInBufferBytes" value="524288" /> -->
<!-- <property name="recorderOutBufferBytes" value="16384" /> -->
<!-- <property name="scratchDir" value="scratch" /> -->
</bean> |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
There should be a larger stacktrace in one of the log files with more details about why it failed to create the bean. Note that ExtractorChrome is part of heritrix-contrib not the main distribution of Heritrix, so if you're not using contrib, that could be why. It's also a bit half-baked and not really suitable for production use yet. I should probably add a warning to the docs about that. |
Beta Was this translation helpful? Give feedback.
I solved it by setting
lazy-init="true"
on the bean! 🎉Here are parts of the stack trace, so future google searches might find this thread