Web Scraper for consolidating a bunch of emails from various UT Austin departments. The emails are being collected using the Beautiful Soup and Selenium Python libraries.
- Implementing the Selenium Edge Driver
- Extracting emails from all the pages (where it can be done without selenium)
- Figure out In-N-Out Selenium navigation method
- Extracting emails from most of the Liberal Arts Directories
- Adding all the emails to the drive
- Figuring out how to automate the email sending process (?)
- Cockrell School of Engineering
- Jackson School of Geoscience
- College of Fine Arts
- Texas School of Law
- School of Nursing
- Steve Hicks School of Social Work
- College of Natural Sciences
- School of Architecture
- School of Information
- School of Education
- College of Pharamacy
- Mccombs School of Business
- Dell Medical School
- LBJ School of Public Affairs
- Moody College of Communication
- African Studies
- Air & Air Space Force Science
- American Studies
- Anthropology
- Asian Studies
- Classics
- Economics
- English
- French and Italian
- Geography & Environment
- Germanic Studies
- Government
- History
- Linguistics
- Mexican American and Latina/o Studies
- Middle Eastern Studies
- Military Science
- Naval Science
- Philosophy
- Pyschology
- Religious Studies
- Rhetoric and Writing
- Slavic & Eurasian Studies
- Sociology
- Spanish & Portugeuse
- College of Education
- Mccombs School of Business
- College of Pharamacy
- Dell Medical School
- LBJ School of Public Affairs
- Moody College of Communication
- School of Architecture
- School of Information
- Cockrell School of Engineering
- College of Fine Arts
- College of Natural Sciences
- Jackson School of Geosciences
- Graduate School Staff
- School of Law
- School of Nursing
- Steve Hicks School of Social Work
- The College of Liberal Arts
The college of liberal arts encapsulates multiple departments (school of anthropology, history, linguistics, etc.). Each indiviudal department can be parsed without using Selenium, however, accessing each faculty page in a timely manner would best be done using Selenium.
The actual code for the Faculty pages that REQUIRE Selenium is pretty messy, since there's few uniform naming conventions, and so a lot of special keywords (particularly in the form of Regular Expressions) had to be used in those cases. Going into files, Driver.py is pretty much the 'main' file here, handling most of our web driver operations. HTMLParser.py contains our helper object just to make the code easier to parse, whilst CleanFile.py is for cleaning up our files we've parsed, i.e., dealing with duplicate emails or clearing up empty space.
- Using Google Takeout to extract emails with the proper tag
- Learning the mbox format's structure
- Writing a parser that streams through the mbox file
- Extract names and emails
- Writing extracted results to a
.csv
file