-
Notifications
You must be signed in to change notification settings - Fork 763
Archiving Rich Media Content
This section of the Heritrix user guide examines the challenges facing archivists of rich-media content and offers specific suggestions for configuring the Heritrix crawler so that it is optimized to handle these challenges.
Rich-media includes different kinds of advanced Web content that allows
users a greater range of experience than normal text/html Web pages.
Rich-media encompasses video, animation, images, and audio. Rich-media
also includes interactive technologies that go beyond basic FORM data
entry. Examples include Wikis, such as Wikipedia, technologies like
AJAX that provide users with fine grained control over Web content, and
Flash technology, which exposes a vast array of user interface widgets
that allow spatial arrangements of data well beyond the capabilities of
first-generation Web sites. Rich-media is also characterized by larger
file sizes than normal text/html
pages.
Rich-media content, such as Flash and video, is usually much larger than
standard text/html
pages. Crawling such content requires large
investments in storage and bandwidth. To mitigate these issues,
deduplication is recommended for rich-media crawls. Deduplication
detects previously collected content that is redundant and skips the
download of such content. Pointers to the duplicate content allow it to
appear in subsequent crawls. For details see Configuring Heritrix for
Deduplication.
Many rich-media technologies allow links to be embedded in file formats that are not conducive to link extraction. When crawling a rich-media site, it is therefore important to identify if the site has a site-map. A site-map is an HTML page that contains links to all the important pages on the site. By adding the URI of the site-map as a seed to a rich-media crawl, links that would not otherwise be extracted will be archived.
Downloading rich-media content can often cause excessive load to be placed on the crawling computers memory and CPU. For example, extracting links from Flash and other rich-media resources requires extensive data parsing, which is CPU intensive. Atypical input patterns can also cause excessive CPU usage when regular expressions used by Heritrix are run. It is therefore recommended that rich-media crawls be allocated more memory and CPU than "normal" crawls. The memory allocated to Heritrix is set from the command line. The following example shows the command line option to allocate 1 GB of memory to Heritrix, which should be sufficient for most rich-media crawls.
export JAVA_OPTS=-Xmx1024M
Multi-core processors are also recommended for rich-media crawls.
Streaming media is media content delivered sequentially over time to a media-consumer from a media-producer. Examples of streaming media include Internet Radio and TV. Streaming media is concerned with the delivery mechanism of the media format and not the format itself. Heritrix can capture media streamed over HTTP or FTP, but does not recognize other streaming protocols such as Real Time Streaming Protocol (RTSP). This limitation has generated interest in embedding a media player in Heritrix that does recognize most streaming formats. For more information on embedding a Media Player in Heritrix, see the "Archiving Streaming Media on the Web Proof of Concept and First Results" article in the International Web Archiving Workshop 2006 conference paper at http://iwaw.europarchive.org/06/PDF/iwaw06-proceedings.pdf.
Many social networking sites make use of rich-media to enhance their user-experience. For specific guidelines on archiving social media sites see Archiving Social Networking Sites with Archive-It . These instructions apply to the Archive-It application, which is built on top of Heritrix.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse