-
Couldn't load subscription status.
- Fork 1
Webscraper #4
Webscraper #4
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff my man, would be nice to change some style bits so we can collectively iterate on it as a more robust data source to test with.
Maybe also remove commented out code if its not used anymore but otherwise looks almost ready to go
server/scripts/webscraper.py
Outdated
|
|
||
| # To go through each letter's links for courses | ||
| def run_course(): | ||
| alphabet = ['A','B','C','D','E','F','G','H','I','L','M','N','O','P','R','S','T','V','Y','Z'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard coded constants should go at top of code preferably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can import this: https://docs.python.org/3/library/string.html#string.ascii_uppercase
from string import ascii_uppercaseThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i brought this up some time ago in person with him, he is not using the whole alphabet for some reason which has eluded me now
server/scripts/webscraper.py
Outdated
| def run_course(): | ||
| alphabet = ['A','B','C','D','E','F','G','H','I','L','M','N','O','P','R','S','T','V','Y','Z'] | ||
|
|
||
| for letter in alphabet[0:2]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it die on letting it run for the whole of alphabet? This should be fine for keeping to have some sample data but would be nice to have it get as much as possible
server/scripts/webscraper.py
Outdated
| print(spec_link) | ||
| print("") | ||
|
|
||
| run_course() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets put this in a if '__name__' == '__main__: block
server/scripts/webscraper.py
Outdated
| ##### SPECIALISATIONS (WIP) ##### | ||
|
|
||
| def run_spec(): | ||
| alphabet = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','R','S','T','V','W'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard coded constants at top of code
server/scripts/webscraper.py
Outdated
| def run_spec(): | ||
| alphabet = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','R','S','T','V','W'] | ||
|
|
||
| for letter in alphabet[0:2]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same range comment as for run_course
| course_soup = BeautifulSoup(response.text, "html.parser") | ||
|
|
||
| # Do webscraping | ||
| tr = course_soup.find_all('tr') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
searches for the table rows because all the courses share a pattern. tbh should use regex to make it more efficient.
|
|
||
| # Do webscraping | ||
| tr = course_soup.find_all('tr') | ||
| for i in range(1,3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for i in range(1, len(tr)) but again it will probably change with a regex implementation
|
|
||
| print_course(code, link, name, cred) | ||
|
|
||
| # Go to course link and scrape the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this enters each course's link and scrapes all the relevant data, some stuff like term offerings are still not done because it requires some extra tests
| spec_link = spec_td[0].find_all('a')[0]['href'] | ||
| print(spec_name) | ||
| print(spec_link) | ||
| print("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it needs to collect courses inside the specialisation links, but they use different structures, so that's why i got a bit stuck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I used loops and beautifulsoup to work through the scraping based on the knowledge I had. I have a bit more proficiency and could learn regex easier now, so the implementation should be updated to reflect the change.
Atm, the script will stop at the end of the letter (e.g: AVIA4002) because it hasn't found a way to differentiate the table rows. An easy fix is to have a condition to check classnames, but with regex, it shouldn't be a problem and can move to the next letter.
No description provided.