Chapter 1 : Anatomy of a Crawler

Definition of a "Web Crawler"

The definition of a "Web Crawler" may vary, but the one I am going to adopt here is "An automated piece of software, that's capable of retrieving data from a specific website, as it's source". Other than capturing data, Web-Crawlers may also be used for indexing web pages (just like search engines do), looking for a specific content and even simulate some actions within a website (filling forms, clicking buttons, submiting data and so forth). The latter doesn't appear so often, but they are really powerful.

Usual Crawling Behavior

Studying the behavior of Web-Crawlers is really simple, mainly because the "actions" it performs happens as part of a cycle that repeats itself until the process is done fetching the data it needs. Usually speaking a crawler goes like this : Execute HTTP Request for a Page -> Scrape / Parse It's content -> Reach next links -> Repeat .

The usual web-crawler starts from a single point within the web (or within a certain website), finds the links that it is interested in visiting (based on it's own "Selection Policy") and then, moves forward to visit those links. Once the program reaches the "links of interest" (which may be even 100% of the links it finds, depending on the application specifics of the crawler) it will run it's "processing logic" (scrapping/parsing) over the page's content and keeps moving forward to the next links.

So, for instance, a specific crawler may start on the login page of a online dating site, perform the "login" action, and from this point on, start looking after only the links that are tied to a "profile page", parse the profile's data and move to the next one.

Another example can be a web-crawler that would start off from the home page of a famous retail store, and without the need to login, start visiting the links of the top selling products for it's data.

The applications and sources of data for web-crawlers are pretty much endless, given the "size" of the world wide web today.

Crawling Policy

The behavior of a Web Crawler is the outcome of a combination of policies:

a "Selection Policy" that states which pages to download / visit
a "Re-visit Policy" that states when to check for changes to the pages
a "Politeness Policy" that states how to avoid overloading Websites
a "Parallelization Policy" that states how to coordinate distributed web-crawlers

"Selection Policy"

More often than not, people write web-crawlers for a specific domain, and this is actually a pretty simple, but yet, representative example of a clear "Selection Policy".

For the sake of this example, let's say that you are writing a crawler of a mainstream media website, that will scrape the news out of it. A good starting point for your crawler would be the "Home" page, since it's the URL that's less likely to either expire or change, so let's proceeed accordingly.

Given that you have the HTML of the "Home" page, how do you pick which will be the next pages to be visited ? (Remember that this crawler is for this specific domain). In this case, the "Selection Policy" will be as simple as figuring out which links of this page are from this domain, discarding the ones that lead to other domains. Voilá, you've wrote your own logic that tells your crawler when it should visit a certain page and when it shouldn't.

"Re-visit Policy"

The Web has a very dynamic nature, and crawling a fraction of the Web can take weeks or months. By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates and deletions. From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most-used cost functions are freshness and age. On the other hand, on a more "static" (that doesn't change this often), there's almost no cost with not detecting a change, or not capturing a change right after it happens.

Three simple revisiting policies are : "Uniform", "Proportional" and "No Revisiting".

"Uniform" revisiting means that, regardless of the rate of changes of each page, all of them will be re-visited with the same frequency.

"Proportional" revisiting involves visiting more often the pages that change more frequently. The frequency that each page is visited should be proportional to it's change rating.

"No Revisiting" is self explanatory. No page will be visited twice. Period.

"Politeness Policy"

Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers. So yes, your Crawler may impact someone else's bills / pockets, please have that in mind.

Have in mind that each request the server is "serving" to your crawler, it may not be serving a legit user / client. Obviously there are studies and good practices that must be applied to your own logic, and implemented by you, that will get you "invited back" (or at least, won't get you kicked out of the party), and I am going to cover a few of them on this course.

A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. Recently commercial search engines like Google, Ask Jeeves, MSN and Yahoo! Search are able to use an extra "Crawl-delay:" parameter in the robots.txt file to indicate the number of seconds to delay between requests.

There are lots of studies going on, regarding the "ideal" Crawl-Delay time that should be used for each site, but in the end, being polite is what really matters. What I use as a "starting point" is a one second delay between requests, and if that doesn't prooves to be enough, I start increasing the time by powers of 2.

"Parallelization Policy"

Will be covered on it's own chapter : "Scaling up your Crawler"

Main Concepts I hope you grasped

Possible actions a "Web-Crawler" can perform
Usual "Cycle" (set of actions) of a Crawler
The difference between each "Policy"

Hands On

Check the sub-project of this project called "Chapter 1" for a simple demonstration of those concepts. At this point, you can go through the code as you wish, reading the comments and running it.

I wouldn't recommend you modify the code as of now, because the next chapters will hopefuly give you a better overview of the "Tooling" and "Frameworks" we are going to use, which will give you a clearer idea of where exactly each piece falls into place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly