Bodacc is a scraper application that scrape every Bodacc announcements (2008-actual) on the DILA website in a Postgresql database Bodacc
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Be sure to have Ruby >= 2.3.4
Clone the repository and install all necessary gem by running the following command:
$ bundle install
The repository include a dump of the empty database called "structure.sql". Create your database with it by running the following command:
$ pg_dump bodacc < structure.sql
The database is called "bodacc"
bodacc
├── bilans
├── immatriculations
├── modifications
├── pcls
└── radiations
The scraper is a single Ruby file. Just launch it with Ruby. For exemple
$ DATABASE_URL=postgres://localhost:5432/bodacc ruby main.rb
The first time you use the scraper execute this command
$ DATABASE_URL=postgres://localhost:5432/bodacc ruby main.rb
It will download every bodacc announcements from 2008 to now. After that, if you launch the same command again, it will only download announcements that were posted after the last created_at datetime. Imagine you want to download just a specific year then launch the following command:
$ DATABASE_URL=postgres://localhost:5432/bodacc ruby main.rb 2015
Bodacc use the Nokogiri gem and the Mechanize gem in order to scrape and download every files. After unzipping them, the script inserts them into the bodacc database.
If you use this scraper for the first time be aware that inserting everything from 2008 to the year before actual will take a lot of time (you'll have time to watch the Star Wars saga with all the bonuses ... twice). In fact the files weigh about 300 MB and contain a total of just over 20 million announcements.
- Castres Maxime - Initial work - Mcastres