Skip to content
felipecsl edited this page May 25, 2012 · 41 revisions

Wombat

Wombat

Getting Started

Wombat is a simple ruby DSL to crawl webpages on top of the cool Mechanize and Nokogiri gems. It is aimed to be a more high level abstraction if you dont want to dig into the specifics of getting the page and parsing it into your own data structure, which can be a decent amount of work, depending on what you need.

With Wombat, you can simply call Wombat.crawl to get started. You can also specify a class that includes the module Wombat::Crawler and tells which information to retrieve and where. Basically you can name stuff however you want and specify how you want to find that information in the page. For example:

class MyCrawler
  include Wombat::Crawler

  base_url "http://domain_to_be_crawled.com"
  list_page "/path-to/page-with/the-data"
  
  some_data "css=div.elemClass .anchor"
  another_info "xpath=//my/xpath[@style='selector']"
end
Wombat.crawl do
  base_url "http://domain_to_be_crawled.com"
  list_page "/path-to/page-with/the-data"
  
  some_data "css=div.elemClass .anchor"
  another_info "xpath=//my/xpath[@style='selector']"
end

The both examples above are equivalent and provide the same results. You can either use the class based approach (1st one), or simply call Wombat.crawl (2nd approach). Behind the scenes, Wombat.crawl is just a shortcut so you don't need to create a class if it doesn't make sense for you.

Let's see what is going on here. First you create your class and include the Wombat::Crawler module. Simple. Then, you start naming the properties you want to retrieve in a free format. The restrictions are the same as you would have to name a ruby method, because this is in fact a ruby method call :)

The lines that say base_url and list_page are reserved words to say where you want to get that information from. You should include only the scheme (http), host and port in the base_url field. In the list_page, you include only the path portion of the url to be crawled, always starting with a forward slash.

To run this crawler, just instantiate it and call the crawl method:

my_cool_crawler = MyCrawler.new
my_cool_crawler.crawl

This will request and parse the page according to the filters you specified and return a hash with that same structure. We'll get into that detail later. Another important detail is that selectors can be specified as CSS or XPath. As you probably already realized, CSS selectors should start with css= and XPath with xpath=. By default, properties will return the text of the first matching element for the provided selector. Again, very important: Each property will return only the first matching element for that selector.

If you want to retrieve all the matching elements, use the option :list.

Sections

Additional resources

Clone this wiki locally