The scraper extracts information from the austrian development projects since 2010 from the austrian development agency website. The automatically extracted informations are stored in CSV and JSON files to make the further usage as easy as possible.
This repository provides the code and documentation and keeps track of bugs as well as feature requests.
- Data Source
- Team: Gute Taten für gute Daten project of Open Knowledge Austria
- Status: Production
- Documentation: English
- Licenses:
- Content: Creative Commons Attribution 4.0
- Software: MIT License
Used software
The sourcecode is written in Python 2. It was created with use of iPython, BeautifulSoup4 and urllib2.
Description
The scraper fetches the overview page html with the table, stores it locally and parses out the data with beautifulsoup4. Then the scraper downloads every aid project entry and parses out the description from it. At the end, the data is stored as JSON and CSV files for easy usage later on.
Run scraper
Go into the root folder of this repository and execute following commands in your terminal:
cd code
python aid-scraper.py
Original sourcecode
Thanks to Christian Goebel for the original sourcecode, which got used for the final version.
Configure the Scraper
There are two global variables in aid-scraper.py you may want to change to your needs.
- DELAY_TIME: To not overload the server or may get blocked because of too many request, you should set the delay time to fetch to 1-5 seconds, not less.
- TS: The timestamp as a string can be set to the last download. So you can use downloaded data over and over again and must not do it everytime. When you do it first time, you can set the value to
datetime.now().strftime('%Y-%m-%d-%H-%M')
, so it is the timestamp when the scraper starts.
Download raw html
Here all the html raw data gets downloaded, stored locally and the basic data gets parsed.
- Download all overview pages with the tables (html). The navigation for the fetching runs through all overview pages by asking the existance of the "weiter" anchor and counting up an url variable.
- Open the downloaded files.
- Parse out the basic information about each project from the overview tables. This is necessary here, because the download of the project page needs the link from the overview table.
- Store the parsed data as JSON file.
- Download all project pages (html).
Parse html
Here the description of the project gets added to the data.
- Open the JSON data.
- Open the project-pages files (html).
- Parse out the additional description information from the project pages.
- Store updated data as JSON file.
Export as CSV
Here the data gets exported as a CSV file.
- Open the data (JSON).
- Save the serialized data as CSV file.
The original data is from the project list of the austrian development agency (ADA) published on their website. The data consists of all contracts approved since January 1st of 2010. in the list in chronologically descending order. The date of the last update can be found on the first table page as "Datum der letzten Aktualisierung".
The tables are the basic data, where most of the data is parsed out. The data is published in the following structure (e. g. first project).
Vertragsnummer | Vertragstitel | Land/Region | OEZA/ADA-Vertragssumme | Vertragspartner |
---|---|---|---|---|
2325-02/2016 | Programm zum Schutz der MenschenrechtsverteidigerInnen in der westlichen Region Guatemalas | Guatemala | EUR 64.300,00 | HORIZONT3000 - Österreichische Organisation für Entwicklungszusammena |
Attributes
- Vertragsnummer: contract number of project.
- Vertragstitel: title of project.
- Land/Region: country or region, where project takes place at.
- OEZA/ADA-Vertragssumme: amount of money granted by contract.
- Vertragspartner: partner(s) in the project.
When you click on the contract titel in a table you get to the project page. It consists of the same data as the table view, except the additional description text (named "Beschreibung").
So far, we can not say anything about the data quality (completeness, accuracy, etc.), but there are also so far no reasons to doubt the quality.
Data errors found
- Land/Region missing
- Land/Region and Vertragsnummer missing
- Vertragsnummer missing
- Vertragssumme is in Partner field, Partner is missing
- Vertragssumme is missing
raw html
The scraper downloads all raw html of each table and each project page.
aid data JSON
The parsed data is stored in an easy-to-read JSON file for further usage.
[
{
'contract-number': contract number of the project
'contract-title': title of the project
'country-region': country and/or region, where the project takes place
'OEZA-ADA-contract-volume': amount of funding by austrian development agency
'contract-partner': partner organisation(s)
'description': description text of the project
'url': url of the project page
},
]
aid data csv
The parsed data is stored in a human-readable CSV file for further usage.
columns (see attribute description above):
- contract-number
- contract-title
- OEZA-ADA-contract-volume
- contract-partner
- country-region
- description
- url
row: one project each row.
In the spirit of free software, everyone is encouraged to help improve this project.
Here are some ways you can contribute:
- by reporting bugs
- by suggesting new features
- by translating to a new language
- by writing or editing documentation
- by analyzing the data
- by visualizing the data
- by writing code (no pull request is too small: fix typos in the user interface, add code comments, clean up inconsistent whitespace)
- by refactoring code
- by closing issues
- by reviewing pull requests
- by enriching the data with other data sources
When you are ready, submit a pull request.
We use the GitHub issue tracker to track bugs and features. Before submitting a bug report or feature request, check to make sure it hasn't already been submitted. When submitting a bug report, please try to provide a screenshot that demonstrates the problem.
All content is openly licensed under the Creative Commons Attribution 4.0 license, unless otherwisely stated.
All sourcecode is free software: you can redistribute it and/or modify it under the terms of the MIT License.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Visit http://opensource.org/licenses/MIT to learn more about the MIT License.
Aid
Documentation
- Österreichische Entwicklungs Zusammenarbeit
- Entwicklungszusammenarbeitsgesetz inklusive EZA-Gesetz-Novelle 2003
- ODA Bericht 2014
- ODA Bericht 2013
- ODA Bericht 2012
- ODA Bericht 2011
- ODA Bericht 2010
- README.md: Overview of repository
- code/aid-scraper.py: scraper
- CHANGELOG.md
- LICENSE
See the whole history. Next the actual version.
extended scraper
- aid-scraper.py: fixed the csv output bug caused by cariage return characters.
- update the README.md: add description of scraper.