Skip to content

Session materials for the Web Scraping with Python course at #NICAR20

License

Notifications You must be signed in to change notification settings

hancush/web-scraping-with-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web scraping with Python

If you need data that's trapped on a website, writing some code to scrape the page could be your solution. This entry-level class will show you how to use the Python programming language to harvest information from websites into a spreadsheet. We'll introduce you to the command line and show you how to write enough code to fetch and parse web content.

Workshop prerequisites: This class is programming for beginners. Some basic familiarity with Python and HTML is helpful but not required.

Class outline

  1. 🐍 Python basics (45 minutes to 1 hour)
  2. 💧 Water break! (10 minutes)
  3. 🔣 HTML Basics (15 minutes)
  4. 🛠 Scraping the web (Remaining time)

You will learn...

  • Some Python basics
    • Data types: String, numeric, and Boolean types
    • Data structures: Lists and dictionaries
    • Control flow: if... else statements
    • Iteration: for... in statements
    • Functions: Reusable bits of code
  • How to write and run Python code using Jupyter Notebooks
    • Retrieve web content with requests
    • Parse meaningful information from raw HTML with beautifulsoup4
    • Output tabular data with csv
  • How to inspect source code in your browser
  • How to go about getting unstuck

Next steps

Looking to expand on what you've done in this workshop? Here are some new adventures:

  • Install Python on your own machine and learn how to manage Python dependencies
  • Learn how to run your scripts from the command line
    • 💡 Check out this tutorial) to review the scraping concepts covered in this class and learn the basics of the command line
  • Keep writing simple scrapers!
    • 💡 For inspiration, check out City Scrapers, a collection of scrapers that gathering information on public meetings, written by 60+ contributors of all skill levels
  • Learn more precise HTML parsing approaches, e.g., lxml and xpath
  • Graduate to more complicated scraping tasks, e.g., scrapes that rely on state
    • 💡 For inspiration, check out python-legistar-scraper, a Python library for scraping legislative data from the Legistar web interface and API

Credits

The content for this course was cribbed heavily from IRE's one-hour course on web scraping with Python.

Some copy in the HTML basics section was lifted from the canonical (to me) First web scraper tutorial, also developed for IRE. When you're ready to move from Jupyter notebooks into the command line, I'd strongly recommend starting with this workshop!

Who am I?

👋 I'm Hannah! I apply my journalism background to civic technology projects as a Lead Developer at DataMade. These include:

  • Writing a web driver to fill out a branching, stateful web form in service of lowering the barrier to completing a prerequisite to doing business with or receiving funds from the City of Chicago
  • Maintaining an inter-system scrape, transform, load (ETL) pipeline for legislative data
  • Managing millions of payroll and pension records to power the Illinois Public Salaries Database and the Illinois Public Pensions Database

About

Session materials for the Web Scraping with Python course at #NICAR20

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published