Webscraper #4

jjccharles · 2019-07-21T14:20:40Z

No description provided.

…o webscraper

jjccharles

Good stuff my man, would be nice to change some style bits so we can collectively iterate on it as a more robust data source to test with.

Maybe also remove commented out code if its not used anymore but otherwise looks almost ready to go

jjccharles · 2019-08-25T12:32:32Z

server/scripts/webscraper.py

+
+# To go through each letter's links for courses
+def run_course():
+    alphabet = ['A','B','C','D','E','F','G','H','I','L','M','N','O','P','R','S','T','V','Y','Z']


hard coded constants should go at top of code preferably

You can import this: https://docs.python.org/3/library/string.html#string.ascii_uppercase

from string import ascii_uppercase

I think i brought this up some time ago in person with him, he is not using the whole alphabet for some reason which has eluded me now

jjccharles · 2019-08-25T12:33:20Z

server/scripts/webscraper.py

+def run_course():
+    alphabet = ['A','B','C','D','E','F','G','H','I','L','M','N','O','P','R','S','T','V','Y','Z']
+
+    for letter in alphabet[0:2]:


does it die on letting it run for the whole of alphabet? This should be fine for keeping to have some sample data but would be nice to have it get as much as possible

server/scripts/webscraper.py

jjccharles · 2019-08-25T12:35:39Z

server/scripts/webscraper.py

+            print(spec_link)
+            print("")
+
+run_course()


lets put this in a if '__name__' == '__main__: block

server/scripts/webscraper.py

jjccharles · 2019-08-25T12:36:30Z

server/scripts/webscraper.py

+##### SPECIALISATIONS (WIP) #####
+
+def run_spec():
+    alphabet = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','R','S','T','V','W']


hard coded constants at top of code

jjccharles · 2019-08-25T12:36:40Z

server/scripts/webscraper.py

+def run_spec():
+    alphabet = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','R','S','T','V','W']
+
+    for letter in alphabet[0:2]:


same range comment as for run_course

…style changes

LiberoHS · 2019-09-20T01:24:35Z

server/scripts/webscraper.py

+        course_soup = BeautifulSoup(response.text, "html.parser")
+
+        # Do webscraping
+        tr = course_soup.find_all('tr')


searches for the table rows because all the courses share a pattern. tbh should use regex to make it more efficient.

LiberoHS · 2019-09-20T01:25:21Z

server/scripts/webscraper.py

+
+        # Do webscraping
+        tr = course_soup.find_all('tr')
+        for i in range(1,3):


for i in range(1, len(tr)) but again it will probably change with a regex implementation

LiberoHS · 2019-09-20T01:27:17Z

server/scripts/webscraper.py

+
+            print_course(code, link, name, cred)
+
+            # Go to course link and scrape the data


this enters each course's link and scrapes all the relevant data, some stuff like term offerings are still not done because it requires some extra tests

LiberoHS · 2019-09-20T01:30:07Z

server/scripts/webscraper.py

+            spec_link = spec_td[0].find_all('a')[0]['href']
+            print(spec_name)
+            print(spec_link)
+            print("")


it needs to collect courses inside the specialisation links, but they use different structures, so that's why i got a bit stuck

LiberoHS

Initially, I used loops and beautifulsoup to work through the scraping based on the knowledge I had. I have a bit more proficiency and could learn regex easier now, so the implementation should be updated to reflect the change.

Atm, the script will stop at the end of the letter (e.g: AVIA4002) because it hasn't found a way to differentiate the table rows. An easy fix is to have a condition to check classnames, but with regex, it shouldn't be a problem and can move to the next letter.

LiberoHS added 3 commits July 22, 2019 00:06

Added v1 of webscraper.py (WIP)

81a1869

Merge branch 'master' of https://github.com/csesoc/degree-planner int…

b14af68

…o webscraper

Description functionality and started on specialisations

1617691

jjccharles commented Aug 25, 2019

View reviewed changes

jjccharles requested a review from LiberoHS September 20, 2019 00:30

jjccharles added 2 commits September 19, 2019 20:31

Importing base pathways scraper to renovate, updated webscraper with …

a4511e9

…style changes

Removed sqlite stuff in preparation to integrate psql setup

5c209f1

LiberoHS reviewed Sep 20, 2019

View reviewed changes

LiberoHS approved these changes Sep 20, 2019

View reviewed changes

jjccharles merged commit a4cbb04 into master Sep 20, 2019

jjccharles deleted the webscraper branch September 20, 2019 01:41


		print_course(code, link, name, cred)

		# Go to course link and scrape the data

Uh oh!

Webscraper #4

Webscraper #4

Uh oh!

Conversation

jjccharles commented Jul 21, 2019

Uh oh!

jjccharles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiberoHS left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants