A Python-based web scraper to extract and process metadata from Statistics Canada (StatCan) sources.
The StatCan Web Scraper is designed to automate the process of retrieving statistical metadata from the official Statistics Canada website. It supports efficient extraction, parsing, and storage of structured data for further analysis.
- Scrape metadata from StatCan tables and datasets
- Parse and clean the extracted data
- Export results to CSV or JSON formats
- Modular design for easy extension
- Clone the repository:
git clone https://github.com/avtomatik/statcan_web_scraper.git
cd statcan_web_scraper- Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
python3 -m pip install --upgrade pip
pip install --no-cache-dir -r requirements.txtRun the scraper via the main script:
python3 src/main.py
# python3 -m src.mainThe scraper will retrieve metadata and store results in the output/ directory.
python3 src/main.py --dataset "Table 11-10-0010-01"statcan_web_scraper/
├── src/
│ ├── main.py # Entry point
│ ├── core/
│ ├── spiders/ # Spider definition
│ └── utils/ # Utility functions
├── requirements.txt # Python dependencies
├── README.md
└── LICENSE.md
Configuration options (like dataset IDs, output formats, and logging levels) can be set in src/config.py.
Contributions are welcome! To contribute:
- Fork the repository
- Create a new branch (
git checkout -b feature/my-feature) - Make your changes
- Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/my-feature) - Open a Pull Request
This project is licensed under the MIT License. See LICENSE.md for details.