Background Reading

Jump to bottom Edit New page

Alex Osborne edited this page Jul 4, 2018 · 3 revisions

Must reads

Haydon, A; Najork, M. Mercator: A Scalable, Extensible Web Crawler (wayback (http://web.archive.org/web/\*/http://research.compaq.com/SRC/mercator/papers/www/paper.html)), 1999
Haydon, A; Najork, M. High-performance web crawling, 2001
Kimpton, Stata, Mohr. Internet Archive Crawler Requirements Analysis for library consortium, 2003
Lee, H; Leonard, D; Wang, X; Loguinov, D. IRLbot: Scaling to 6 Billion Pages and Beyond (new from WWW2008)

Nice to reads

Najork, M.; Wiener, J. Breadth-First Search Crawling Yields High-Quality Pages, 2001
Cho, J.; Garcia-Molina, H.; Page, L. Efficient Crawling Through URL Ordering, 1998
Abiteboul, S.; Preda, M.; Cobena, G. Computing web page importance without storing the graph of the web (extended abstract), 2001
Olsten, C.; Pandey, S. Recrawl Scheduling Based on Information Longevity (new from WWW2008)

Information on Java with respect to Heritrix/crawling

Haydon, A; Najork, M. Performance Limitations of the Java Core Libraries (may not reflect latest Java issues, Heritrix uses a high performance DNS package)

Find these (also may be outdated with respect to current Java and our implementation choices) at the archive-crawler Yahoo Group files page:

G. B. Reddy Study of synch vs. asynch IO in Java
G. B. Reddy Study of multi-threaded DNS performance in Java

Others

Archive-crawler group files
Cho, J.; Garcia-Molina, H. The Evolution of the Web and Implications for an Incremental Crawler, Conf. on Very Large Data Bases, 2000
Focused Crawling The Quest for Topic-specific Portals
Focused Crawling: : A New Approach to Topic-Specific Web Resource Discovery, 1999, WWW8
Intelligent Crawling on the World Wide Web with Arbitrary Predicates, 2001, WWW10
Web Crawling High-Quality Metadata using RDF and Dublin Core, 2002, WWW11
Stanford WebBase Project
An Introduction to Heritrix - Mohr et al, 4th International Web Archiving Workshop 2004

Relevant specifications

RFC 2616: Hypertext Transfer Protocol - HTTP/1.1
- Clarifying the fundamentals of HTTP By Jeffery Mogul, an author of RFC-2616.
RFC 3986: Uniform Resource Identifiers (URI): Generic Syntax.
HTML 4.01 specification (from W3C).
Although robots.txt is important for crawling, it's never been officially ratified as an RFC. The defacto minimal spec live at robotstxt.org. Search engines have made a number of ad hoc extensions; Google recently shared some info about how GoogleBot implements the Robots Exclusion Protocol.
RFC 1034: Domain Names - Concepts and Facilities
RFC 1035: Domain Names - Implementation and Specification

Attachments

Download All{.download-all-link}

Attachments:

crawler-requirements-2003-03.htm (text/html)
Mohr-et-al-2004.pdf (application/pdf)
1998-Cho-efficient.pdf (application/pdf)
1999-Heydon-javalimits.pdf (application/pdf)
1999-Hirai-webbase.pdf (application/pdf)
1999-Mercator.pdf (application/pdf)
2000-Broder-webgraph.pdf (application/pdf)
2000-Cho-incremental.pdf (application/pdf)
2001-Abiteboul-crawlorder.pdf (application/pdf)
2001-Arasu-search.pdf (application/pdf)
2001-Najork-breadthfirst.pdf (application/pdf)
2001-Najork-highperf.pdf (application/pdf)
2002-Guillaume-webgraph.pdf (application/pdf)
2008-IRLBot.pdf (application/pdf)
2008-Olston-recrawl.pdf (application/pdf)
2002-Shkapenyuk-polybot.pdf (application/pdf)

Structured Guides:

FAQs

User Guide

Knowledge Base

Known Issues

Unresolved Javascript Extraction Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally