Refactor Web Scraping for Printers on Campus #360

cejiogu · 2025-10-15T14:59:06Z

Overview

In this commit, I:

Refactor the manner through which the application scrapes information about campus printers from the web
Standardize all scraped information
Introduce labels as another field of information for each printer
Introduce migrations to the application's database
Add scripts to run migrations and populate database

Changes Made

Change 1: Printer information is retrieved from the API request that populates the page
Previously, the application used BeautifulSoup to parse the HTML of the webpage that displays details about each printer on campus; with this change, I transition the application from HTML scraping to API scraping, now utilizing the API request that populates the HTML table directly instead of reading from the webpage. Now, details about each printer on campus are retrieved from the JSON object returned as a response to the API call that populates the page. This change makes our information retrieval cleaner, safer, and more efficient.

Change 2: Printer location information is standardized
To account for minor "mutations" in the scraped data, I also used the difflib Python library to map scraped building names to a canonical list of building names, ensuring that all scraped locations are standardized. This ensures that if "Baker Lab CLOSED FOR CONSTRUCTION," for example, is scraped from the website, we still only see "Baker Lab" on the actual application.

Change 3: Introduce labels as another field of information for each printer
To also implement labels for each printer, I include "Labels" as another field in each object in the list returned from the scrape_printers function. To populate this field, I created a canonical list of labels — notably, which only accounts for "Residents Only," "AA&P Students Only," and "Landscape Architecture Students Only," and is unlikely to be exhaustive — and then used the difflib Python library to recognize any canonical labels in the parsed data. I also include printer capabilities as labels, meaning that "Color," "Black & White," and "Color, Scan, & Copy" are labels as well.

For this information to be stored in our SQLite database, I also introduced two new tables: a "labels" table — which stores the unique labels a given printer may have — and a "printer_labels" table — which is a junction table, mapping each printer to its corresponding labels.

Finally, for this information to be retrieved via our API, I updated the fetchAllPrinters function found in EcosystemUtils.js also return a list of labels for each returned printer.

Change 4: Introduce migrations to the application's database
Previously, the application's backend lacked a structured method for modifying the schema of the database after its initial creation, which prevented me from introducing labels and printer_labels tables without reinitializing the database. To account for this, I used the better-sqlite3 library to implement migrations, creating a directory storing the changes made to the database as .sql files. Note: the naming of each migration file follows the convention of YYYYMMDD_HHMM_<change>.

Change 5: Add scripts to run migrations and populate database
I implement two new scripts to set up the application's database: npm run migrate and npm run populate:db. npm run migrate calls the run-migrations.js file, which executes a function to apply all migrations (that haven't already been applied) to the database. npm run populate:db calls the populate_db.py file to execute the scrapers for the libraries and printers, and fills the database with the new scraped data.

Test Coverage

To test the refactored web scraping, the label parsing and mapping, and the location parsing and mapping, I ran src/data/scrapers/printers.py as a module to ensure that each location was mapped to a name in the canon, and that the correct labels were assigned to each printer. To do this, I pasted the following code at the bottom of my file.

if __name__ == "__main__":
    results = scrape_printers()
    print(f"Scraped {len(results)} printers.\n")

    # Print a sample of the data
    for row in results:
        print(row)

Then, I moved my working directory to the scrapers folder, and then ran the file using the following command.

python3 printers.py

To test the migrations, I first added the following code to the bottom of the populate_db function in the src/data/scripts/populate_db.py file.

import pprint
pprint.pprint(get_printer_labels())

Second, I added the following code beneath (and outside the scope of) the populate_db function in the same file.

def get_printer_labels():
    """
    For each printer id, retrieve its associated labels as a dict {printer_id: [labels]}
    """
    import sqlite3
    from data.db.models import DB_PATH

    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()

    # Get all printer ids
    cursor.execute("SELECT id FROM printers")
    printer_ids = [row[0] for row in cursor.fetchall()]

    # Prepare result dictionary
    printer_labels_map = {}

    for printer_id in printer_ids:
        # Join to get label strings for the given printer_id
        cursor.execute("""
            SELECT labels.label
            FROM printer_labels
            JOIN labels ON printer_labels.label_id = labels.id
            WHERE printer_labels.printer_id = ?
        """, (printer_id,))
        labels = [row[0] for row in cursor.fetchall()]
        printer_labels_map[printer_id] = labels

    conn.close()
    return printer_labels_map

Finally, in src/data/db/models.py and the src/data/db/database.py files, I replaced the definition of the DB path with DB_PATH = os.getenv("DB_PATH").

Together, this code reads the newly-added printer information from the database, which we compare to the table on the Cornell website for accuracy. Note: the scraper skips over the last 8 rows of the table, as those printers do not have an associated location.

To execute this code, I ran the following command in the terminal, which creates, populates, and reads from a test database.

# Create the schema for the test db, which is the original db schema (excludes "labels" and "printer_labels" tables)
DB_PATH="./src/data/transit.test.db" python3 ./src/data/db/models.py;

# Apply migrations to the test db's schema, adding the "labels" and "printer_labels" tables
DB_PATH="./src/data/transit.test.db" npm run migrate

# Run the scraper and populate the test db
DB_PATH="./src/data/transit.test.db" python3 ./src/data/scripts/populate_db.py

Should this work, you should see in the terminal a mapping of each printer's printer_id to the corresponding printer's labels, all read from the test database.

…agger documentation

…or migration

cejiogu added 7 commits October 11, 2025 21:30

Update web scraping for printers

9bcbd10

Implement baseplate labeling for scraped data

772a893

Add labels for printer colors

4b008e4

Update description to exclude labels

7f87078

Add comments/documentation and clean up code

ff5d61b

Include labels in database creation and population

906c633

Update endpoint for fetching printer information and corresponding sw…

de836bc

…agger documentation

cejiogu requested review from Aayush-Agnihotri and Daniel-jw October 15, 2025 14:59

cejiogu marked this pull request as ready for review October 15, 2025 15:03

cejiogu added 8 commits November 7, 2025 19:30

Add script to run migrations on database

f23bcc5

Add migration files to create labels and printer_label tables

f8d95d5

Remove labels and printer_labels table from database initialization f…

9fe37ff

…or migration

Minor bug fix

919125c

Export script to run migrations (and populate db)

aa664b7

Fix scraping and database bugs

58917a8

Fix imports

8f8486c

Add pycache

7e062cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Web Scraping for Printers on Campus #360

Refactor Web Scraping for Printers on Campus #360

Uh oh!

cejiogu commented Oct 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor Web Scraping for Printers on Campus #360

Are you sure you want to change the base?

Refactor Web Scraping for Printers on Campus #360

Uh oh!

Conversation

cejiogu commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes Made

Test Coverage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cejiogu commented Oct 15, 2025 •

edited

Loading