Skip to content
Alex Osborne edited this page Jul 4, 2018 · 3 revisions

WARC File Format

The WARC file format is a successor to the ARC format. (The ARC format has been used for many years to store the Internet Archive's web captures.) Small example ARC and WARC (v0.17) files from a shallow (~2 hops) Heritrix crawl of the www.archive.org website are attached to this wiki page. It is easy to create larger, more representative ARC and WARC files using any recent release of Heritrix.

Download All{.download-all-link}

Compared to ARC, note that WARC adds:

  1. an expandable amount of header info per record
  2. optional new record types for data/metadata other than just HTTP responses (which was all that ARC recorded)

ISO Standard

In May of 2009, a proposed WARC standard was approved as ISO standard ISO 28500:2009, and the latest versions of Heritrix output WARC files which conform to this standard as described at http://bibnum.bnf.fr/WARC/ (latest draft as of November 2008).

Attachments:

IAH-20080430204825-00000-blackbook.arc.gz (application/gzip)
IAH-20080430204825-00000-blackbook.warc.gz (application/gzip)

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally