Crawl Job Settings Documentation for 3.4.0 #513

mutlurasit · 2020-07-30T14:00:36Z

mutlurasit
Jul 30, 2020

Hello,

I am trying to understand crawler-beans.cxml usage in Heritrix but can't find any best practice articles or detailed manual anywhere. There are a few documentation floating on the net and some explanation inside .cxml but they are either out dated or not complete.

Wiki has a good page for basic usage but as the name suggest it is basic. The rest seems to refer an older version of Heritrix.

Is there any source I am missing or a place where I can examine good practice examples?

Sorry if this is not the appropriate platform for the request.

hennekey · 2020-07-30T15:30:41Z

hennekey
Jul 30, 2020

Hey @mutlurasit can you elaborate a little on what you need help trying to do?

The cxml file is essentially the configuration for a "type" of crawl. They provide information for how a crawl is setup to run.

0 replies

mutlurasit · 2020-07-31T10:37:10Z

mutlurasit
Jul 31, 2020
Author

Thanks for your reply @hennekey I am trying to do things like;

Feeding a list of seeds to Heritrix but getting individual WARC files instead of one for all the crawl. The way around for this for me currently is creating separate jobs for each domain but I think that will be a problem as I crawl more and more sites.
Related to that I want to set up regular crawls for certain sites (for example crawl annually) not sure how to configure this.

0 replies

hennekey · 2020-07-31T13:44:57Z

hennekey
Jul 31, 2020

@mutlurasit Are you code-savvy? You can implement custom processors to replace the ones you see in the cxml file to achieve custom behavior, like creating separate WARC files per domain. I do not believe that the existing code contains that logic.

Annually recrawling would be best accomplished with a cron schedule I think. Unless you want to keep the process running (and expect it to do so without issue) for the whole duration.

0 replies

mutlurasit · 2020-07-31T14:53:30Z

mutlurasit
Jul 31, 2020
Author

Thanks again @hennekey , I can't say I am that competent with coding but would appreciate if there is any source you can suggest that I can dig around. Do you know which part of the cxml (if any) deals with WARC creation?

Regarding scheduling, you are right that sound reasonable I will also look into that as well.

0 replies

hennekey · 2020-07-31T15:06:54Z

hennekey
Jul 31, 2020

This is where the code finds the writer (and consequently the file) to use to persist data to a WARC: https://github.com/internetarchive/heritrix3/blob/master/modules/src/main/java/org/archive/modules/writer/WARCWriterProcessor.java#L155

0 replies

mutlurasit · 2020-07-31T15:09:23Z

mutlurasit
Jul 31, 2020
Author

Thank you very much one more time!

0 replies

cgr71ii · 2022-09-06T08:21:17Z

cgr71ii
Sep 6, 2022

Hi! Regarding the question of the issue, is there any resource where documentation is updated and complete? I'm facing this problem, since the resources which I found are ReadTheDocs and the wiki, and they are either not complete/basic (e.g. logToFile property not documented) or not updated. Is this a problem which will be solved or is expected to dig into the code in order to understand the advanced options?

0 replies

ato · 2022-09-06T09:28:39Z

ato
Sep 6, 2022
Maintainer

e.g. logToFile property not documented

Looks like the logToFile property exists (1) on DecideRuleSequence and (2) on everything inheriting from Scoper.

I've added DecideRuleSequence to the bean reference in f736bf2
I've filed bug Bean reference missing inherited properties #497 but probably won't work on this myself right now.

is there any resource where documentation is updated and complete?

The Java API documentation is complete in the sense of listing every class and property.

Is this a problem which will be solved or is expected to dig into the code in order to understand the advanced options?

Digging into the code is sometimes a practical necessity to fully understand some of the options and behavior. Heritrix has no dedicated developers and problems are generally solved by affected users contributing fixes. :-)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl Job Settings Documentation for 3.4.0 #513

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Crawl Job Settings Documentation for 3.4.0 #513

mutlurasit Jul 30, 2020

Replies: 8 comments

hennekey Jul 30, 2020

mutlurasit Jul 31, 2020 Author

hennekey Jul 31, 2020

mutlurasit Jul 31, 2020 Author

hennekey Jul 31, 2020

mutlurasit Jul 31, 2020 Author

cgr71ii Sep 6, 2022

ato Sep 6, 2022 Maintainer

mutlurasit
Jul 30, 2020

hennekey
Jul 30, 2020

mutlurasit
Jul 31, 2020
Author

hennekey
Jul 31, 2020

mutlurasit
Jul 31, 2020
Author

hennekey
Jul 31, 2020

mutlurasit
Jul 31, 2020
Author

cgr71ii
Sep 6, 2022

ato
Sep 6, 2022
Maintainer