Skip to content

gambl/dscrape

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DScrape: Declarative Web Scraping

Usage:
    dscrape <CTS File> <URL>

Optional Arguments:
    --format fmt    Specify an output format for data. Valid formats are:
                      json
                      pretty (default)

    --verbose       Include helpful status and debugging messages in output.
                    If this flag is turned off, you can simply pipe the
                    command output into a file for saving.

Development Setup

You'll need the following tools to run DScrape:

  • Node.js

    Installation instructions at http://nodejs.org/

  • JSDOM (node module), which simulates the DOM API within a node JS process.

    On a Mac: sudo npm install -g jsdom

  • Contextify (node module)

    NOTE: This is essential for JSDOM to work correctly. JSDOM will suffer run-time errors without this plugin. NOTE 2: This will require you to have XCode installed on a Mac. One a Mac: sudo npm install -g contextify

  • Pretty JSON (node module), which enables pretty-print of JSON data

    On a Mac: sudo npm install -g prettyjson

  • Optimist (node module), for options parsing

    On a Mac: sudo npm install -g optimist

And the following tools to develop with DScrape:

Building the Project

From the project root, type:

coffee --compile --output lib/ src/

This will create lib/dscrape.js for you, using src/dscrape.coffee as source. The bin/dscrape executable relies on this library.

Included Examples

Run these examples from the project root.

  • Reddit

    ./bin/dscrape examples/reddit.cts http://www.reddit.com

Todo

Several TODOs for this project exist:

  • Enable output for multiple serialization formats, to be toggled via command line. e.g.: JSON, YAML, etc.
  • Identify (and resolve) issues with JSDOM that cause it to behave differently from the browser

About

Declarative Web Scraping

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 98.8%
  • CoffeeScript 1.2%