Refactor Web Scraping for Printers on Campus #360
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
In this commit, I:
Changes Made
Change 1: Printer information is retrieved from the API request that populates the page
Previously, the application used BeautifulSoup to parse the HTML of the webpage that displays details about each printer on campus; with this change, I transition the application from HTML scraping to API scraping, now utilizing the API request that populates the HTML table directly instead of reading from the webpage. Now, details about each printer on campus are retrieved from the JSON object returned as a response to the API call that populates the page. This change makes our information retrieval cleaner, safer, and more efficient.
Change 2: Printer location information is standardized
To account for minor "mutations" in the scraped data, I also used the
difflibPython library to map scraped building names to a canonical list of building names, ensuring that all scraped locations are standardized. This ensures that if "Baker Lab CLOSED FOR CONSTRUCTION," for example, is scraped from the website, we still only see "Baker Lab" on the actual application.Change 3: Introduce labels as another field of information for each printer
To also implement labels for each printer, I include "Labels" as another field in each object in the list returned from the
scrape_printersfunction. To populate this field, I created a canonical list of labels — notably, which only accounts for "Residents Only," "AA&P Students Only," and "Landscape Architecture Students Only," and is unlikely to be exhaustive — and then used thedifflibPython library to recognize any canonical labels in the parsed data. I also include printer capabilities as labels, meaning that "Color," "Black & White," and "Color, Scan, & Copy" are labels as well.For this information to be stored in our SQLite database, I also introduced two new tables: a "labels" table — which stores the unique labels a given printer may have — and a "printer_labels" table — which is a junction table, mapping each printer to its corresponding labels.
Finally, for this information to be retrieved via our API, I updated the
fetchAllPrintersfunction found inEcosystemUtils.jsalso return a list of labels for each returned printer.Change 4: Introduce migrations to the application's database
Previously, the application's backend lacked a structured method for modifying the schema of the database after its initial creation, which prevented me from introducing
labelsandprinter_labelstables without reinitializing the database. To account for this, I used thebetter-sqlite3library to implement migrations, creating a directory storing the changes made to the database as.sqlfiles. Note: the naming of each migration file follows the convention ofYYYYMMDD_HHMM_<change>.Change 5: Add scripts to run migrations and populate database
I implement two new scripts to set up the application's database:
npm run migrateandnpm run populate:db.npm run migratecalls therun-migrations.jsfile, which executes a function to apply all migrations (that haven't already been applied) to the database.npm run populate:dbcalls thepopulate_db.pyfile to execute the scrapers for the libraries and printers, and fills the database with the new scraped data.Test Coverage
To test the refactored web scraping, the label parsing and mapping, and the location parsing and mapping, I ran
src/data/scrapers/printers.pyas a module to ensure that each location was mapped to a name in the canon, and that the correct labels were assigned to each printer. To do this, I pasted the following code at the bottom of my file.Then, I moved my working directory to the
scrapersfolder, and then ran the file using the following command.To test the migrations, I first added the following code to the bottom of the
populate_dbfunction in thesrc/data/scripts/populate_db.pyfile.Second, I added the following code beneath (and outside the scope of) the
populate_dbfunction in the same file.Finally, in
src/data/db/models.pyand thesrc/data/db/database.pyfiles, I replaced the definition of the DB path withDB_PATH = os.getenv("DB_PATH").Together, this code reads the newly-added printer information from the database, which we compare to the table on the Cornell website for accuracy. Note: the scraper skips over the last 8 rows of the table, as those printers do not have an associated location.
To execute this code, I ran the following command in the terminal, which creates, populates, and reads from a test database.
Should this work, you should see in the terminal a mapping of each printer's
printer_idto the corresponding printer's labels, all read from the test database.