This project downloads, analyzes, and visualizes data from the Philippines Department of Public Works and Highways (DPWH) archive hosted on Internet Archive.
The DPWH archive contains approximately 60GB of data across 31 zip files, representing a comprehensive dataset of public works and infrastructure information from the Philippines government.
- Python 3.7+
- Internet connection for downloading
- Sufficient disk space (~60GB for full download)
- Clone or download this repository
- Install dependencies:
pip install -r requirements.txtExecute the complete data pipeline:
python dpwh_pipeline_runner.pyThis will:
- Download all 31 zip files (~60GB)
- Extract and analyze the data structure
- Generate interactive visualizations
- Create an HTML report
open-data-visualization/
├── dpwh_archive_downloader.py # Downloads all archive files
├── dpwh_archive_analyzer.py # Extracts and analyzes data
├── dpwh_data_visualizer.py # Creates visualizations
├── dpwh_pipeline_runner.py # Main orchestration script
├── requirements.txt # Python dependencies
└── README.md # This file
python dpwh_archive_downloader.py- Downloads all 31 zip files from Internet Archive
- Creates
dpwh_archive/directory - Shows progress for each download
- Handles resume for interrupted downloads
python dpwh_archive_analyzer.py- Extracts all zip files to
dpwh_archive/extracted/ - Analyzes file structure and types
- Examines metadata files (XML, SQLite)
- Samples data files for content analysis
- Generates
data_summary.json
python dpwh_data_visualizer.py- Creates interactive charts and graphs
- Generates comprehensive HTML report
- Shows file type distributions
- Displays directory structure analysis
After running the pipeline, you'll find:
dpwh_archive/- Downloaded zip filesdpwh_archive/extracted/- Extracted data filesdpwh_archive/data_summary.json- Analysis resultsdpwh_analysis_report.html- Interactive visualization report
- Comprehensive Download: All 31 archive files with progress tracking
- Smart Extraction: Automatic zip file extraction with error handling
- Data Analysis: File type analysis, size distribution, metadata examination
- Interactive Visualizations: Plotly-based charts and graphs
- HTML Report: Self-contained report with embedded visualizations
- Resume Capability: Skip already downloaded files
- Error Handling: Robust error handling throughout the pipeline
The generated report includes:
- File type distribution charts
- File size analysis by extension
- Directory structure overview
- Archive metadata summary
- Interactive data exploration tools
The DPWH archive contains various file types including:
- CSV files with infrastructure data
- PDF documents and reports
- Excel spreadsheets
- XML metadata files
- Database files (SQLite)
- Text documents
You can modify the scripts to:
- Change download directory
- Adjust visualization parameters
- Add custom data analysis
- Modify report styling
- Add new file type handlers
- The download process may take several hours depending on your internet connection
- Ensure you have sufficient disk space (60GB+ recommended)
- The analysis process may take time for large datasets
- All scripts include progress indicators and error handling
Feel free to submit issues, feature requests, or pull requests to improve this data visualization pipeline.
This project is for educational and research purposes. Please respect the Internet Archive's terms of service and the DPWH's data usage policies.
Last Updated: October 2025