Pismo extracts machine-usable metadata from unstructured (or poorly structured) English-language HTML documents. Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords. Pismo is used heavily in production on http://coder.io/ to extract data from Web pages.
All tests pass on Ruby 1.8.7 (MRI) and Ruby 1.9.1-p378 (MRI).
A basic example of extracting basic metadata from a Web page:
require 'pismo'
# Load a Web page (you could pass an IO object or a string with existing HTML data along, as you prefer)
doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')
doc.title # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
doc.author # => "Peter Cooper"
doc.lede # => "Cramp (GitHub repo) is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
There's also a shorter "convenience" method which might be handy in IRB - it does the same as Pismo::Document.new:
Pismo['http://www.rubyflow.com/items/4082'].title # => "Install Ruby as a non-root User"
The current metadata methods are #title, #titles, #author, #authors, #lede, #keywords, #sentences(qty), #body, #html_body, #feed, #feeds, #favicon, #description and #datetime. These are not fully documented here yet, you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
#html_body and #body will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader" (like Arc90's Readability or Safari Reader) algorithm. #body returns it as plain-text, #html_body maintains some basic HTML styling.
There are some shortcomings or problems that I'm aware of and am going to pursue:
- I do not know how Pismo fares on JRuby, Rubinius, or others yet.
- The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction.
- The author name extraction is quite poor.
- The image extraction only handles images with absolute URLs.
- The stopword list leaves a bit to be desired. It errs on the side of being too long rather than too short, though (1024 words long!)
A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
---
:url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
:title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
:lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
:author: Peter Cooper
:datetime: 2010-01-07 12:00:00 +00:00
If you call pismo without any arguments (except a URL), it starts an IRB session so you can directly work in Ruby. The URL provided is loaded and assigned to both the constant 'P' and the variable @p.
You can access Pismo's stopword list directly:
Pismo.stopwords # => [.., .., ..]
- Fork the project.
- Make your feature addition or bug fix.
- Add tests for it. This is important so I don't break it in a future version unintentionally.
- Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler (which is awesome, btw).
- Send me a pull request. I may or may not accept it (sorry, practicality rules.. but message me and we can talk!)
Apache 2.0 License - See LICENSE for details. Copyright (c) 2009, 2010 Peter Cooper
In short, you can use Pismo for whatever you like commercial or not, but please include a brief credit (as in the NOTICE file - as per the Apache 2.0 License) somewhere deep in your license file or similar, and, if you're nice and have the time, let me know if you're using it and/or share any significant changes or improvements you make.