This template provides a starting point for git scraping—the technique of scraping data from websites and automatically committing it to a Git repository using workflows, coined by Simon Willison.
Git scraping helps create an audit trail capturing snapshots of data over time. It leverages Git's version control and a continuous integration's scheduling capabilities to regularly scrape sites and save data without needing to manage servers.
The key benefit is automating web scrapers to run on a schedule with little overhead. The scraped data gets stored incrementally so you can review historical changes. This helps enable use-cases like price monitoring, content updates tracking, research datasets building, and more. The ability to have these resources for virtually free, enables the use of this technique for a wide range of projects.
Tools like GitHub Actions, GitLab CI and others make git scraping adaptable to diverse sites and data needs. The scraping logic just needs to output data serialized formats like CSV, JSON etc which gets committed back to git. This makes the data easily consumable downstream for analysis and vis.
This template includes a sample workflow to demonstrate the core git scraping capabilities. Read on to learn how to customize it!
The workflow defined in .github/workflows/scrape.yaml
runs on a defined schedule to:
- Checkout the code
- Set up the Python environment
- Install dependencies via Pipenv
- Run the python script
script.py
to scrape data - Commit any updated data files to the Git repository
The workflow schedule is configured with cron syntax to run:
- Every day at 8PM UTC
This once-daily scraping is a good rule-of-thumb, as it is generally respectful of the target website, as it does not contribute to any measurable burden to the site's resources.
You can use crontab.guru to generate your own cron schedule.
The main libraries used are:
bs4
- BeautifulSoup for parsing HTMLrequests
- Making HTTP requests to scrape web pagesloguru
- Logging errors and run infopytz
- Handling datetimes and timezoneswaybackpy
- Scraping web archives (optional)
To adapt this for your own scraping project:
- Use this template to create your own repository
- Modify
script.py
to scrape different sites and data points:- Modifying the request URL
- Parsing the HTML with BeautifulSoup to extract relevant data
- Processing and outputting the scraped data as CSV, JSON etc
- Update the workflow schedule as needed
- Output and commit the scraped data to CSV, JSON or other formats
- Add any additional libraries to
Pipfile
that you need - Update this
README.md
with project specifics
Feel free to use this as a starter kit for your Python web scraping projects!
It is recommended to use a version manager, and virtual environments and environment managers for local development of Python projects.
asdf is a version manager that allows you to easily install and manage multiple versions of languages and runtimes like Python. This is useful so you can upgrade/downgrade Python versions without interfering with your system Python.
Pipenv creates a virtual environment for your project to isolate its dependencies from other projects. This allows you to install packages safely without impacting globally installed packages that other tools or apps may rely on. The virtual env also allows reproducibility of builds across different systems.
Below we detail how to setup these environments to develop this template scrape project locally.
Once you have installed asdf
, you can install the Python plugin with:
asdf plugin add python
Then you can install the latest version of Python with:
asdf install python latest
After that, you can first install pipenv
with:
pip install pipenv
Then you can install the dependencies with:
pipenv install --dev
This will create a virtual environment and install the dependencies from the Pipfile
. The --dev
flag will also install the development dependencies, which includes ipykernel
for Jupyter Notebook support.
You can then run the script to try it out with:
pipenv run python script.py
Web scraping is a powerful tool for gathering data, and its legality has been upheld.
But it is important to use it responsibly and ethically. Here are some guidelines to consider:
-
Review the website's Terms of Service and
robots.txt
file to understand allowances and restrictions for automated scraping before starting. -
Avoid scraping copyrighted content verbatim without permission. Summarizing is safer. Use data judiciously under "fair use" principles.
-
Do not enable illegal or fraudulent uses of scraped data, and be mindful of security and privacy.
-
Check that your scraping activity does not overload or harm the website's servers. Scale activity gradually.
-
Reflect on whether scraping could unintentionally reveal private user or organizational information from the site.
-
Consider if scraped data could negatively impact the website's value or business model.
-
Assess if decisions made using the data could contribute to bias, discrimination or unfair profiling.
-
Validate quality of scraped data, and recognize limitations in ensuring relevance and accuracy inherent with web data.
-
Document your scraping process thoroughly for replicability, transparency and accountability.
-
Continuously re-evaluate your scraping program against applicable laws and ethical principles.