LazyScraper

LazyScraper is the easy way to define lazy entity-oriented Web scrapers.

Note: This is only a proof-of-concept.

Usage

Let’s say we want to fetch some reviews from FooBar website (which doesn’t have a public API). Reviews are located at '/review?product_id=something' (we’ll leave the domain part here).

We start by creating a class which inherit from LazyScraper::Entity:

class FooBarReview < LazyScraper::Entity
end

Then we’ll add some hooks. A hook map a set of attributes to an URL with a parser. This is used to ensure that a webpage is fetched & parsed only once, and only at the right time. Here, we’ll assume that each review has a product id we know, a product name, a score, and a text. They are all located on the same page, but LazyScraper also support hooks on multiple URLs.

class FooBarReview < LazyScraper::Entity
  attr_hook '/review?product_id=:product_id',
    :product_name, :score, :text do |doc, attrs|

    attrs[:product_name] = doc.css('#product .name').text
    attrs[:score]        = doc.css('#score').text.to_i
    attrs[:text]         = doc.css('#text').text
  end
end

Here, attr_hook takes the path to the page, with a :product_id placeholder, which will later be replaced by the actual product_id of a review. Then, we gives it the list of attributes which depends on this webpage. This way, the page will be fetched and parsed only the first time we access one of the attributes. The last argument is a block which takes a Nokogiri document and a hash we’ll populate in it.

That’s all, we can now try our class:

# note how we’re given the product id
lazy_review = FooBarReview.new :product_id => 42

# we haven’t fetched the page yet

lazy_review.text  # this fetches the page and return the text
lazy_review.score # this returns the score without fetching the page again

Requirements

Ruby 2.x

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
lib		lib
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LazyScraper

Usage

Requirements

About

Uh oh!

Releases

Packages

Languages

License

bfontaine/LazyScraper

Folders and files

Latest commit

History

Repository files navigation

LazyScraper

Usage

Requirements

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages