Web scrapper, data analysis and website for the best classes at Harvard.
If you found it useful, you can
Course ratings correlate well with recommendation score.
Course ratings also correlate well with lecturer scores, but with more scatter.
Sentiment analysis on the course comments also agree well with its average course rating.
Most high-scoring courses have low workload.
Harvard classes tend to have high ratings. It is rare to get a low score.
Most Harvard classes have a workload demand of around 5 hours per week outside of classes, though the distribution is skewed so some classes have much higher workloads.
There is little correlation between the number of students in the class and the score of the class.
More analysis, and the code for the graphs can be found through this Colab Notebook. A copy of
the notebook is also available in the repo above as course_ratings_analysis.ipynb
. Remember to upload verbose_course_ratings.csv
if you hope to tinker around.
The code for the website can be found at this repo. This repo is for the scrapping and analytics.
If you use a virtual environment, please specify Python 3.11 for numpy compatibility
conda create -n harvard-gems python=3.11
conda activate harvard-gems
Then install the requirements
pip install -r requirements.txt
You probably don't need to follow the steps below since the results can be found at verbose_course_ratings.csv
, but
this is a step-by-step guide on how to create that csv from scratch.
- Download the webpage from the link in
scrapper.py
as a HTML-only file namedQReports.html
. Runscrapper.py
to scrape the links for the QGuides for each course. The links generated will be stored atcourses.csv
. - Visit any QGuide links scrapped at
courses.csv
to get the cookies (see the code ofdownloader.py
for the search termsecret_cookie
) and paste it at a new file namedsecret_cookie.txt
. Note that using the current VScode reader for CSV will open a defunct link with other columns suffixed. Manually copy the link and paste it in your browser. Rundownloader.py
to download all the QGuides with the links scrapped from the previous step. The QGuides will be stored at the folderQGuides
. - Run
analyzer.py
to generatecourse_ratings.csv
. - Now we have to add details like divisional requirement or whether it fulfils quantitative reasoning with data (QRD),
but most importantly we need to know whether this class is offered in Fall 2024 (the QGuides are for Fall 2023). First clear the
not-offered.txt
. You might need to install Selenium for this step, but first try runningmyharvarddriver.py
to use Selenium to get these necessary details from my.harvard.edu. Depending on your machine, you might need more setup to use Selenium, so you can check out the official guide or see my notes below. Make sure to editdriver_path
. The webpages for each class will be stored as HTML files at the foldermyharvard
. This should be the step that takes the longest (around 1.5 hours), I usually leave it running overnight.- If you need more setup, download it here. Remember to download the chromedriver and not chrome.
- Run this in the folder
xattr -d com.apple.quarantine chromedriver
- You should be able to run it now.
- New in 2024 Fall, some classes have sections to be chosen during registration, like CHNSE 130 and EXPOS 40. Run
rescrape.py
to handle these cases which require an additional click.- Simply rerun the file if there are errors. It will pick up those courses that are not done.
- Process these webpages to get the data by running
append_details.py
. This will generateverbose_course_ratings.csv
as required. - Start a Jupyter notebook session (
jupyter notebook
) and choosecourse_ratings_analysis.ipynb
to run. This will generate the graphs above and the data atoutput_data
. Follow through the notebook and play around! - If you are maintaining this repo, then please make a folder under archive for the upcoming semester, and then put the following files there:
course_ratings.csv
,courses.csv
,not-offered.txt
,QReports.html
,verbose_course_ratings.csv
.
I have been told that it is possible to scrape the classes from my.harvard using an API (e.g. by looking at the requests at the Network tab) without authentication.