Skip to content
This repository was archived by the owner on Jan 23, 2022. It is now read-only.

Conversation

@jjccharles
Copy link
Contributor

No description provided.

Copy link
Contributor Author

@jjccharles jjccharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff my man, would be nice to change some style bits so we can collectively iterate on it as a more robust data source to test with.

Maybe also remove commented out code if its not used anymore but otherwise looks almost ready to go


# To go through each letter's links for courses
def run_course():
alphabet = ['A','B','C','D','E','F','G','H','I','L','M','N','O','P','R','S','T','V','Y','Z']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard coded constants should go at top of code preferably

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can import this: https://docs.python.org/3/library/string.html#string.ascii_uppercase

from string import ascii_uppercase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i brought this up some time ago in person with him, he is not using the whole alphabet for some reason which has eluded me now

def run_course():
alphabet = ['A','B','C','D','E','F','G','H','I','L','M','N','O','P','R','S','T','V','Y','Z']

for letter in alphabet[0:2]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it die on letting it run for the whole of alphabet? This should be fine for keeping to have some sample data but would be nice to have it get as much as possible

print(spec_link)
print("")

run_course()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets put this in a if '__name__' == '__main__: block

##### SPECIALISATIONS (WIP) #####

def run_spec():
alphabet = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','R','S','T','V','W']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard coded constants at top of code

def run_spec():
alphabet = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','R','S','T','V','W']

for letter in alphabet[0:2]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same range comment as for run_course

@jjccharles jjccharles requested a review from LiberoHS September 20, 2019 00:30
course_soup = BeautifulSoup(response.text, "html.parser")

# Do webscraping
tr = course_soup.find_all('tr')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

searches for the table rows because all the courses share a pattern. tbh should use regex to make it more efficient.


# Do webscraping
tr = course_soup.find_all('tr')
for i in range(1,3):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for i in range(1, len(tr)) but again it will probably change with a regex implementation


print_course(code, link, name, cred)

# Go to course link and scrape the data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this enters each course's link and scrapes all the relevant data, some stuff like term offerings are still not done because it requires some extra tests

spec_link = spec_td[0].find_all('a')[0]['href']
print(spec_name)
print(spec_link)
print("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to collect courses inside the specialisation links, but they use different structures, so that's why i got a bit stuck

Copy link
Contributor

@LiberoHS LiberoHS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I used loops and beautifulsoup to work through the scraping based on the knowledge I had. I have a bit more proficiency and could learn regex easier now, so the implementation should be updated to reflect the change.

Atm, the script will stop at the end of the letter (e.g: AVIA4002) because it hasn't found a way to differentiate the table rows. An easy fix is to have a condition to check classnames, but with regex, it shouldn't be a problem and can move to the next letter.

@jjccharles jjccharles merged commit a4cbb04 into master Sep 20, 2019
@jjccharles jjccharles deleted the webscraper branch September 20, 2019 01:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants