Skip to content

Releases: internetarchive/heritrix3


03 Feb 05:26
@ato ato
Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New Features

  • Groovy crawl configs (experimental): Groovy Bean Definition DSL can now be used as an experimental alternative to Spring XML. This enables more terse and human-readable job configuration with inline scripting capabilities. There is no user interface for it in this release. For now, you must manually create a crawler-beans.groovy file in your job directory. #632

  • ExtractorHTML obeyRelNofollow: This option skips extraction of links marked rel=nofollow. This is useful for avoiding crawler traps on some sites. #638


  • Cookie rejected warning: The slf4j change in 3.6.0 inadvertently caused a previously hidden warning to be logged to job.log when a server sends a Set-Cookie header with a disallowed domain value. This warning is now suppressed since it occurs frequently and does not require any action from the crawl operator. #640


  • Removed fastutil: A small number of usages of fastutil were replaced with standard library equivalents in webarchive-commons and Heritrix. This reduced the Heritrix distribution size from 51 MB to 34 MB. iipc/webarchive-commons#101

Dependency Upgrades

  • amqp-client 5.24.0
  • commons-codec 1.17.2
  • ftpserver-core 1.2.1
  • freemarker 2.3.34
  • jetty 9.4.57.v20241219
  • jsch 0.2.22
  • restlet 2.5.0
  • spring 6.1.16
  • webarchive-commons 1.3.0


29 Nov 12:08
@ato ato
Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

Java Compatibility Notice

This release of Heritrix requires Java 17 or later.

New Features

  • Automatic Checkpoints on Shutdown: Added checkpointOnShutdown option to CheckpointService to enable automatic checkpoints if Heritrix is gracefully terminated. #626
  • Command-Line Checkpoint Selection: The --checkpoint command-line option restarts from a named checkpoint when using the --run-job option. #626
  • ConfigurableExtractorJS forceStrictIfUrlMatchingRegexList: URLs matching the regular expressions on this list will be processed in strict mode, with only absolute URLs extracted, not relative ones. #624


  • Upgraded to Spring Framework 6.1: The Spring @Required annotation has been removed, so it was replaced with a custom implementation to maintain backward compatibility with existing crawl configurations. Spring 6 requires Java 17 so Heritrix does now too. #625


  • Manifest Hop Priority: Links from sitemaps are now given the same priority as normal navigation links. They were incorrectly being prioritized as transitive hops (embeds). #623
  • SLF4J Logging: Heritrix now includes slf4j-jdk14 to eliminate a startup warning message and fix logging for dependencies (such as crawler-commons) that use SLF4J. Heritrix doesn't use SLF4J itself. #628

Dependency Upgrades

  • amqp-client 5.23.0
  • commons-cli 1.9.0
  • commons-codec 1.17.1
  • commons-io 2.18.0
  • commons-net 3.11.1
  • crawler-commons 1.4
  • dnsjava 3.6.2
  • easymock 5.5.0
  • freemarker 2.3.33
  • groovy 4.0.24
  • gson 2.11.0
  • httpcomponents 4.5.14
  • java-socks-proxy-server 4.1.2
  • java-websocket removed
  • jaxb-runtime 4.0.5
  • jsch switched to mwiede fork 0.2.21
  • junit 4.13.2
  • kafka-clients 3.9.0
  • kryo 5.6.2
  • pdfbox 3.0.3
  • slf4j 2.0.16
  • spring-framework 6.1.15
  • webarchive-commons 1.2.0


29 Oct 06:58
@ato ato
Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

End of interim releases

This release drops the term "interim release" which distinguished releases made temporarily by the community in the absence of releases made by Internet Archive. The community releases have effectively become the official releases.

In conjunction with this, the version numbers which were paused at 3.4.0 for the interim releases, have now resumed incrementing following the scheme major.minor.patch with the minor release number incremented when features are added or removed.

Java compatibility notice

This will likely be the last release of Heritrix compatible with Java 8. The next release is expected to require Java 17 or later.

Changes in this release


  • Removed HBase modules from contrib. #621


  • ConfigurableExtractorJS: Set default value (false) for strict property. #612
  • ExtractorHTML: Treat cite attribute as a navlink instead of embed. #608
  • Building no longer requires the or Cloudera repositories. #614
  • Updated to new URL of the restlet repository.

Dependency Upgrades

  • Removed hbase, joda-time, log4j
  • commons-io 2.14.0
  • kafka-clients 3.8.0
  • ftpserver-core 1.2.0
  • jetty 9.4.56.v20240826
  • webarchive-commons 1.1.10

2024-09-09 Interim Release

09 Sep 13:05
@ato ato
Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

Compatibility Note

Checkpoints and crawl state created with older versions of Heritrix will not be loadable as kryo has been significantly updated. Replaying the recovery log may be an alternative in some cases.

New Features

  • JDK 22 support
  • Added ConfigurableExtractorJS for more flexible JavaScript extraction. (#602)
  • Added HostnameQueueAssignmentPolicyWithLimits with optional name length limits. (#598)
  • ExtractorHTML can now extract more variants of alternative resolution image URLs. (#605)
    • Attributes are now matched case-insensitively (previously src and SRC worked but not Src)
    • New <img> attributes: data-full-src, data-lazy-srcset, data-src-small, data-src-medium
    • New <link> attribute: imagesrcset
  • ExtractorHTTP can now be configured with extra inferred paths (#597)
  • ExtractorYoutubeDL metadata records can now be optionally logged to crawl.log (#593)


  • Removed ExtractorChrome from contrib (#601)


  • Reduced false positive speculative URLs from meta tags (#595)
  • Fixed BdbModule resource leak on job teardown (f428001)
  • Corrected function name in ScriptedProcessor Javadoc. (#599)
  • Updated Maven builds to use HTTPS for resolving dependencies.
  • Reset CrawlURI status for hasPrerequisite() so that it isn't preserved between attempts (#600)
  • Fixed older junit3 tests not being run (#592)
  • Increased DiskSpaceMonitor default pause threshold to 8 GiB to avoid BDB issue (#499)
  • Stopped logging authentication failures when auth header is missing (#539)
  • Fixed console still showing job running after crash (#549)

Dependency Upgrades

  • Transitioned PDFParser and ExtractorPDF to pdfbox (#575)
  • Transitioned ExtractorYoutubeDL to yt-dlp
  • commons-net 3.9.0
  • com.rabbitmq:amqp-client 5.18.0
  • dnsjava 3.6.0
  • groovy 4.0.21
  • kryo 5.6.0
  • spring-expression 5.3.39

2022-07-27 Interim Release

28 Jul 08:21
Choose a tag to compare

This is the 2022-07-27 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

2021-09-23 Interim Release

30 Sep 12:56
Choose a tag to compare

This is the 2021-09-23 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

2021-08-03 Interim Release

03 Aug 09:44
Choose a tag to compare

This is the 2021-08-03 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

This release includes:

  • Upgrades http-client to version 4.5, including improved cookie handling and expiration.
  • A new browser-based extraction module, ExtractorChrome.
  • JDK16 compatibility improvements.
  • Many more smaller fixes and improvements (see changelog).

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

2021-06-17 Interim Release

17 Jun 15:48
Choose a tag to compare

This is the 2021-06-17 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

IMPORTANT This release was accidentally built with Java 15 and due to changes in the run-time libraries it is not compatible with Java 8 (Java 9 or later should work fine).

This release improves sitemap extraction, and fixes a bug that can interfere with checkpoint creation.

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

2021-05-27 Interim Release

27 May 15:36
Choose a tag to compare

This is the 2021-05-27 release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.4 and should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

Notably, this release includes new modules for finding and using sitemaps. See: Support for extracting URLs in sitemaps #262

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here

2020-05-18 Interim Release

18 May 14:30
Choose a tag to compare

This is the 2020-05-18 release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.4 and should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog

This release features new modules to support archiving over SFTP, but stored as a reponse record rather than the resource record that has been more widely used in the past. The next release will resolve this as per this pull request

The binaries are available via Maven Central here.

If you want the distribution package, you can download it from here