Skip to content

dwillis/statement

 
 

Repository files navigation

Statement

Statement parses RSS feeds and HTML pages containing press releases and other official statements from members of Congress, and produces hashes with information about those pages. It has been tested under Ruby 1.9.3, 2.x, and 3.x.

What's New in Version 2.3

🎉 Major improvements for reliability and maintainability!

  • Error handling: No more crashes! Scrapers log errors and continue running
  • Command-line interface: New statement CLI for easy scraper execution
  • Performance testing: Automated testing suite with detailed performance reports
  • Comprehensive logging: Track what's happening with configurable log levels
  • Better organization: Modular code structure for easier maintenance

See CHANGELOG.md for full details and MAINTAINER_GUIDE.md for documentation.

Coverage

Statement currently parses press releases for members of the House and Senate. For members with RSS feeds, you can pass the feed URL into Statement. For members without RSS feeds (or with broken ones), HTML scrapers are provided, as are methods for special groups, such as House Republicans.

Current status: 185 scrapers (129 member, 56 committee). Many congressional websites now block automated access, but the gem handles these gracefully. See PERFORMANCE_REPORT.txt for current scraper health.

Suggestions and contributions are welcomed!

Installation

Add this line to your application's Gemfile:

gem 'statement'

And then execute:

$ bundle

Or install it yourself as:

$ gem install statement

Quick Start

Command Line Interface (NEW in 2.3!)

Run scrapers directly from the command line:

# Run all member scrapers for current Congress (119th)
$ statement --type members

# Run all committee scrapers
$ statement --type committees

# Run a specific scraper
$ statement --scraper shaheen

# Save results to file with performance metrics
$ statement --type members --output results.json --performance

# List all available scrapers
$ statement --list-scrapers

# Get help
$ statement --help

See statement --help for all options including output formats (JSON, CSV, pretty print), log levels, and more.

Testing Scrapers

Test all scrapers and generate a performance report:

$ ruby -I lib bin/test_scrapers

This will test all 185 scrapers and generate a comprehensive report showing which are working, broken, or slow.

Usage

Statement provides access to press releases, Facebook status updates and tweets from members of Congress. Most congressional offices have RSS feeds but some require HTML scraping.

To configure Statement to pull from the Twitter and Facebook APIs, you can pass in configuration values via a hash or a config.yml file:

require 'rubygems'
require 'statement'
Statement.configure(:oauth_token => token, :oauth_token_secret => secret, ...) # option 1
Statement.configure_with("config.yml") # option 2

If you don't need to use the Twitter or Facebook APIs, you don't need to setup configuration.

Press Releases

To parse an RSS feed, simply pass the URL to Statement's Feed class:

require 'rubygems'
require 'statement'

results = Statement::Feed.from_rss('http://blumenauer.house.gov/index.php?option=com_bca-rss-syndicator&feed_id=1')
puts results.first
{:source=>"http://blumenauer.house.gov/index.php?option=com_bca-rss-syndicator&feed_id=1", :url=>"http://blumenauer.house.gov/index.php?option=com_content&amp;view=article&amp;id=2203:blumenauer-qwe-need-a-national-system-that-speaks-to-the-transportation-challenges-of-todayq&amp;catid=66:2013-press-releases", :title=>"Blumenauer: &quot;We need a national system that speaks to the transportation challenges of ...", :date=>#<Date: 2013-04-24 ((2456407j,0s,0n),+0s,2299161j)>, :domain=>"blumenauer.house.gov"}

Statement will try to parse a date if an RSS feed contains a PubDate element; if not it will return nil.

If you have a batch of RSS URLs, you can pass them to Feed's batch class method, which will use Typhoeus to fetch them in parallel and returns a two-element array of results and failed urls:

urls = ['http://aderholt.house.gov/common/rss//index.cfm?rss=20', 'http://andrews.house.gov/rss.xml', "http://alexander.house.gov/common/rss/?rss=24", "http://amash.house.gov/rss.xml"]
results, failures = Statement::Feed.batch(urls)

The sites that require HTML scraping are detailed in individual methods, and can be called individually or in bulk:

results = Statement::Scraper.billnelson
members = Statement::Scraper.member_scrapers

Facebook Updates

Using the koala gem, Statement can fetch Facebook status feeds, given a Facebook ID. You'll need to either set environment variables APP_ID and APP_SECRET or create a config.yml file containing app_id and app_secret keys and values.

f = Statement::Facebook.new
results = f.feed('RepFincherTN08')

It also can process IDs in batches by passing an array of IDs and a slice argument to indicate how many ids in each batch:

f = Statement::Facebook.new
results = f.batch(facebook_ids, 10)

In all cases Statement strips out posts that are not by the ID, and returns a Hash containing attributes from the feed:

{:id=>"9307301412_10151632750071413", :body=>"This is Gold Star Mother Larraine McGee whose son, Christopher Everett, Army National Guard, was killed in action September 2005. Precious family.", :link=>"http://www.facebook.com/photo.php?fbid=10151632750021413&set=a.118418671412.133511.9307301412&type=1&relevant_count=1", :title=>nil, :type=>"photo", :status_type=>"added_photos", :created_time=>#<DateTime: 2013-05-28T14:49:08+00:00 ((2456441j,53348s,0n),+0s,2299161j)>, :updated_time=>#<DateTime: 2013-05-28T17:41:37+00:00 ((2456441j,63697s,0n),+0s,2299161j)>, :facebook_id=>"9307301412"}

Tweets

Using the twitter gem, Statement can retrieve individual user timelines or list timelines:

t = Statement::Tweets.new
t.timeline('Robert_Aderholt')
[{:id=>344168849484169216, :body=>"Check out the @GOPLeader's weekly schedule for the House this week. http://t.co/mh3FZnK4a8", :link=>"http://majorityleader.gov/floor/weekly.html", :in_reply_to_screen_name=>nil, :total_tweets=>699, :created_time=>2013-06-10 15:07:02 -0400, :retweets=>0, :favorites=>0, :screen_name=>"Robert_Aderholt"}...]

Note that the created_time attribute is a Ruby Time object, as returned by the twitter gem.

To retrieve a list's timeline, pass in the list slug and the owner (defaults to nil):

t = Statement::Tweets.new
t.bulk_timeline('congress')
[:id=>343541632844587008, :body=>"On the Skagit river getting a close-up view of the bridge repairs. http://t.co/SMsdwiFaR6", :link=>nil, :in_reply_to_screen_name=>nil, :total_tweets=>226, :created_time=>2013-06-08 21:34:42 -0400, :retweets=>1, :favorites=>2, :screen_name=>"RepDelBene"}..]

Logging and Error Handling (NEW in 2.3!)

Statement now includes comprehensive error handling and logging:

# Configure logging level (default: INFO)
Statement::Logger.setup(log_level: Logger::DEBUG)

# Scrapers automatically log errors instead of crashing
results = Statement::Scraper.member_scrapers
# Even if some scrapers fail, you'll get results from successful ones

# Errors are logged with full context
# [2024-11-17 10:30:45] ERROR: HTTP error fetching https://example.gov: 403 Forbidden
# [2024-11-17 10:30:45] INFO: Retrying https://example.gov (attempt 1/2)

All scrapers now:

  • Return empty arrays instead of raising exceptions on errors
  • Automatically retry failed requests (up to 2 times with exponential backoff)
  • Log all errors with timestamps and details
  • Include timeout protection (30 seconds)

Tests

Statement uses MiniTest, to run tests:

$ rake test

For comprehensive scraper testing:

$ ruby -I lib bin/test_scrapers --output=report.txt

Contributing

Statement would not be nearly the library it is without our contributors, and we sincerely thank them for their generosity and interest in making congressional press release data more available.

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

If you write a new scraper, please use Nokogiri for parsing - see some of the existing examples for guidance. The domain attribute represents the URI base domain of the source site.

Authors

About

A Ruby gem that extracts press releases and statements by members of Congress.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 81.1%
  • Ruby 18.9%