This repository contains a collection of Python scripts for scraping, processing, and analyzing proceedings from the Dutch Parliament (Handelingen). The scraper collects data from the official government website officielebekendmakingen.nl.
The scripts are organized into three main phases:
- Data Collection - Scraping proceedings links and downloading XML files
- Metadata Collection - Gathering metadata for each proceeding
- Processing - Parsing XML files and extracting structured data into CSV files
- Python 3.7+
- Required packages:
requests,lxml,pandas,cssselect
Install dependencies:
pip install requests lxml pandas cssselectIn most files you can find the variable "vergaderjaren" or something similar that looks like this ["2019-2020", "2020-2021"]. Adjust it to scrape other years.
Scrapes links to parliamentary proceedings from the official website.
- Creates CSV files in
data/links/containing links for each parliamentary year - No command line arguments required
Downloads the XML files for each proceeding.
- Downloads files to
data/handelingen/<year>/ - Maintains a resume file to continue interrupted downloads
- Logs errors to
error_log.txt - No command line arguments required
Validates the downloaded files against the collected links.
- Generates a validation report in
validation_report.txt - No command line arguments required
Collects metadata for each proceeding from their HTML pages.
- Creates CSV files in
data/meta/containing metadata - Maintains a resume file to continue interrupted scraping
- Logs errors to
meta_error_log.txt - No command line arguments required
Validates the collected metadata against the downloaded files.
- Generates a validation report in
meta_validation.log - No command line arguments required
Retries metadata collection for failed files.
- Updates existing metadata files
- Logs retries to
meta_retry.log - No command line arguments required
Processes the XML files and extracts structured data about speeches.
- Creates CSV files in
data/parsed/containing processed speeches - Logs processing to
parser.log - No command line arguments required
Previews the processed speeches data.
- Optional arguments:
year: Parliamentary year to preview (default: "2022-2023")max_rows: Number of rows to show (default: 10)text_length: Maximum length for speech text preview (default: 100)
Development tool for analyzing XML structure and content.
- Processes first 5 files from 2022-2023
- Useful for debugging and understanding the XML structure
- No command line arguments required
.
├── data/
│ ├── handelingen/ # Downloaded XML files
│ ├── links/ # CSV files with proceeding links
│ ├── meta/ # Metadata CSV files
│ └── parsed/ # Processed speech data
├── *.log # Various log files
└── error_log.txt # Error logs
- All scripts include error logging and can resume from interruptions
- Failed downloads and metadata collection are logged to respective error log files
- Validation scripts help identify any missing or problematic files
- The scraper includes rate limiting and random delays to avoid overloading the server
- Downloads are resumable in case of interruption
- The scripts process parliamentary years 2021-2022, 2022-2023, and 2023-2024 by default
This work is licensed under GPLv3
Contact Pieter van Boheemen p.vanboheemen@democratiemonitor.nl