Skip to content

merging in updates from ucsc-data (end of sprint) #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 62 commits into from
Jul 25, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
da06bdb
refactored crawler
SeijiEmery Jul 20, 2018
498585b
updated output to include all data
SeijiEmery Jul 20, 2018
fc56a79
wrote function to grab registrar index urls
SeijiEmery Jul 23, 2018
75225da
wrote course page fetcher
SeijiEmery Jul 23, 2018
7aa7003
started ucsc course parser
SeijiEmery Jul 23, 2018
82fd99e
started parsing course descrips
SeijiEmery Jul 23, 2018
040f929
rewrote fetch_course_pages to handle whitespace (and unicode) correctly
SeijiEmery Jul 23, 2018
4ba8118
cleanup; fully switched to python3 (was supporting python2 for idk wh…
SeijiEmery Jul 23, 2018
9cd4a93
cleaned up output
SeijiEmery Jul 23, 2018
556f836
started parsing ucsc courses
SeijiEmery Jul 23, 2018
56aec63
parsing course listings...
SeijiEmery Jul 23, 2018
0df6931
fixed annoying edge cases
SeijiEmery Jul 23, 2018
4cc1aab
more stupid edge cases
SeijiEmery Jul 23, 2018
aaaf573
blame the f***ing history dept (and film). acronyms !#$*^#*&!
SeijiEmery Jul 23, 2018
5836057
filtering out <p align="..."> tags failed for some reason for biochem…
SeijiEmery Jul 23, 2018
86251d6
finished edge cases...?!
SeijiEmery Jul 23, 2018
0d02cf4
parsed instructors
SeijiEmery Jul 23, 2018
531989d
added info log of courses by dept / div
SeijiEmery Jul 23, 2018
b0a8c79
working on proper prereq parser
SeijiEmery Jul 23, 2018
9db12d6
fixed index parsing (only was fetching half of all course listings...)
SeijiEmery Jul 23, 2018
70926b6
fixed history edge cases
SeijiEmery Jul 23, 2018
3fc4677
fixed weird edge case where the lit page has nested divs...
SeijiEmery Jul 23, 2018
8e14826
fixed edgecases
SeijiEmery Jul 23, 2018
d2c782a
used a terrible hack to fix output from the sociology page (first cou…
SeijiEmery Jul 23, 2018
d21121c
crappy solution to SPECIFICALLY fix a line break inside of a strong t…
SeijiEmery Jul 24, 2018
098a868
fixed it again b/c someone on the anthropology page fucked up and put…
SeijiEmery Jul 24, 2018
ee9110d
updated to skip comments (was not aware that this WASN'T skipping com…
SeijiEmery Jul 24, 2018
31c6b1f
added a ton of parse cases to filter out unneeded tokens
SeijiEmery Jul 24, 2018
3ebf0da
Merge branch 'development' into ucsc-data
SeijiEmery Jul 25, 2018
e3eb1a2
refactored crawler
SeijiEmery Jul 20, 2018
162aa84
updated output to include all data
SeijiEmery Jul 20, 2018
dd35abd
Merge branch 'firebase' of https://github.com/coursegraph/CourseGraph…
SeijiEmery Jul 25, 2018
d057980
added gitignore for data outputs
SeijiEmery Jul 25, 2018
61580b6
wrote function to grab registrar index urls
SeijiEmery Jul 23, 2018
25ae2b5
wrote course page fetcher
SeijiEmery Jul 23, 2018
7128461
started ucsc course parser
SeijiEmery Jul 23, 2018
457769f
started parsing course descrips
SeijiEmery Jul 23, 2018
85480cf
rewrote fetch_course_pages to handle whitespace (and unicode) correctly
SeijiEmery Jul 23, 2018
df8a0ed
cleanup; fully switched to python3 (was supporting python2 for idk wh…
SeijiEmery Jul 23, 2018
12a8729
cleaned up output
SeijiEmery Jul 23, 2018
9c8761b
started parsing ucsc courses
SeijiEmery Jul 23, 2018
05d4e6e
parsing course listings...
SeijiEmery Jul 23, 2018
e5eb758
fixed annoying edge cases
SeijiEmery Jul 23, 2018
18305e9
more stupid edge cases
SeijiEmery Jul 23, 2018
3fddd5f
blame the f***ing history dept (and film). acronyms !#$*^#*&!
SeijiEmery Jul 23, 2018
8feefa5
filtering out <p align="..."> tags failed for some reason for biochem…
SeijiEmery Jul 23, 2018
084f610
finished edge cases...?!
SeijiEmery Jul 23, 2018
f72256a
parsed instructors
SeijiEmery Jul 23, 2018
0d4933a
added info log of courses by dept / div
SeijiEmery Jul 23, 2018
dd58b5d
working on proper prereq parser
SeijiEmery Jul 23, 2018
24f97ef
fixed index parsing (only was fetching half of all course listings...)
SeijiEmery Jul 23, 2018
97efea0
fixed history edge cases
SeijiEmery Jul 23, 2018
42aca19
fixed weird edge case where the lit page has nested divs...
SeijiEmery Jul 23, 2018
6467bd2
fixed edgecases
SeijiEmery Jul 23, 2018
a20f044
used a terrible hack to fix output from the sociology page (first cou…
SeijiEmery Jul 23, 2018
24c76b4
crappy solution to SPECIFICALLY fix a line break inside of a strong t…
SeijiEmery Jul 24, 2018
82435c4
fixed it again b/c someone on the anthropology page fucked up and put…
SeijiEmery Jul 24, 2018
d633682
updated to skip comments (was not aware that this WASN'T skipping com…
SeijiEmery Jul 24, 2018
0a4659e
added a ton of parse cases to filter out unneeded tokens
SeijiEmery Jul 24, 2018
0964fca
Merge branch 'ucsc-data' of https://github.com/coursegraph/CourseGrap…
SeijiEmery Jul 25, 2018
4fdae30
merged minor changes / refactoring done in firebase branch w/ ucsc-da…
SeijiEmery Jul 25, 2018
5af2901
removed data file
SeijiEmery Jul 25, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,10 @@ package-lock.json

*.pyc
/.coveralls.yml

/crawlers/ucsc/data/*
/crawlers/ucsc/prereqs
/crawlers/ucsc/unparsed
crawlers/ucsc/temp
crawlers/ucsd/ucsd_courses.json
crawlers/ucsd/ucsd_graph_data.json
164 changes: 164 additions & 0 deletions crawlers/ucsc/fetch_course_pages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import re
from bs4 import BeautifulSoup, Comment
from urllib.request import HTTPError
from fetch_index import fetch_soup, enforce, fetch_department_urls
import os

def extract_text (element):
# this is REALLY f***ing ugly...
if isinstance(element, Comment):
return ''
elif element.name == 'p':
return '\n%s\n'%(u''.join(map(extract_text, element)))
elif element.name == 'div':
return '\n%s\n'%(u''.join(map(extract_text, element)))
elif element.name == 'br':
return '\n'
elif element.name == 'strong':
# This probably deserves some explaination. Ok, issues are as follows:
# – some idiot put a line break to separate stuff-that-should-be-separated in lgst.
# line break / paragraph element doesn't show up elsewhere, so we have to catch +
# address it here.
# - some other idiot put a line break in anthropology, separating a title that
# SHOULDN'T be separated
#
# So, we do the following:
# – we manually concatenate all of the inner text tags (b/c no way to do this otherwise)
# - if non-empty text is followed by a line break, we emit a '\n' afterwards
# - if not we don't, b/c there shouldn't be any good reason to put a <br /> inside of a
# strong tag given what the registrar page is supposed to look like...
text = ''
has_non_internal_line_break = False
for child in element:
if child.name == 'br':
has_non_internal_line_break = True
elif child.name == None:
text += child
has_non_internal_line_break = False
return text + '\n' if has_non_internal_line_break else text
elif element.name is None:
return '%s'%element
elif element.name == 'comment':
raise Exception("Skipping comment %s"%element.text)
else:
return element.text

def extract_sections (content, dept):
divisions = {}
text = ''
division = None
for child in content:
if child.name == 'h1' or child.name == 'h2' or child.name == 'h3' or child.name == 'h4':
match = re.match(r'^\s*([A-Z][a-z]+(?:\-[A-Z][a-z]+)*)\s+Courses', child.text)
enforce(match, "Expected header to be course heading, got '%s'", child.text)
if division:
divisions[division] = text
text = ''
division = match.group(1)
# print("Setting division: '%s'"%division)
elif division:
if child.name == 'p':
try:
test = child['align']
continue
except KeyError:
pass
text += extract_text(child)
if division:
divisions[division] = text

print("Listed Divisions: %s"%divisions.keys())

text = ''

# THIS IS A TERRIBLE HACK.
# Problem: the sociology page's intro course is missing a course number.
# Solution: this.
# This will break (hopefully) whenever the sociology fixes that page.
# Until then, uh...
if dept == 'socy':
divisions['Lower-Division'] = '1. '+divisions['Lower-Division']

for k, v in divisions.items():
text += '\nDIVISION %s\n%s'%(k, v)
return text

def fetch_dept_page_content (url):
try:
soup = fetch_soup(url)
content = soup.find("div", {"class": "content"})
text = extract_sections(content, url.split('/')[-1].split('.')[0])
enforce(text, "Empty page content: '%s'\nRaw content:\n%s", url, content.text)
text = text.replace('\\n', '')
text = '\n'.join([ line.strip() for line in text.split('\n') ])
return text
except HTTPError:
print("Failed to open department page '%s'"%url)
return None

class DepartmentPageEntry:
def __init__ (self, dept, title, url, content):
self.dept = dept.strip()
self.title = title.strip()
self.url = url.strip()
self.content = content

def __repr__ (self):
return '''[Department %s title '%s' url '%s' content (%d byte(s))'''%(
self.dept, self.title, self.url, len(self.content))

def fetch_department_course_pages (base_url = 'https://registrar.ucsc.edu/catalog/programs-courses', dept_urls = None):
if not dept_urls:
dept_urls = fetch_department_urls(base_url)
enforce(dept_urls, "Could not fetch department urls from index at base url '%s'", base_url)

for title, url in dept_urls.items():
page = url.split(u'/')[-1]
dept = page.split(u'.')[0]
url = u'%s/course-descriptions/%s'%(base_url, page)
print("Fetching '%s' => '%s'"%(title, url))
result = fetch_dept_page_content(url)
if result:
yield DepartmentPageEntry(dept, title, url, result)

def dump_department_pages_to_disk (path='data', base_url = 'https://registrar.ucsc.edu/catalog/programs-courses', dept_urls = None):
for dept in fetch_department_course_pages(base_url, dept_urls):
with open('%s/courses/%s'%(path, dept.dept), 'w') as f:
f.write(u'\n'.join([
dept.dept,
dept.title,
dept.url,
dept.content
]))

def fetch_courses_from_disk (path='data'):
for filename in os.listdir(u'%s/courses/'%path):
with open(u'%s/courses/%s'%(path, filename), 'r') as f:
lines = f.read().split('\n')
result = DepartmentPageEntry(
lines[0],
lines[1],
lines[2],
'\n'.join(lines[3:]))
print("Loaded %s: '%s', %s byte(s)"%(
result.dept, result.title, len(result.content)))
yield result

def fetch_course_pages (*args, **kwargs):
courses = list(fetch_courses_from_disk(*args, **kwargs))
if not courses:
print("No disk cache; refetching")
return fetch_department_course_pages(*args, **kwargs)
return courses


if __name__ == '__main__':
dump_department_pages_to_disk('data')
# dept_urls = fetch_department_urls()
# print("Got %s"%dept_urls)
# for dept in fetch_department_course_pages():
# print(dept)
# print(dept.content)
# print()
63 changes: 63 additions & 0 deletions crawlers/ucsc/fetch_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
import gzip
import io
import unicodedata

def read_url (url):
response = urlopen(url)
return response.read()
try:
buffer = io.StringIO(response.read())
result = gzip.GzipFile(fileobj=buffer)
return result.read().decode('utf8')
except IOError:
return response.read()#.encode('utf8')

def fetch_soup (url):
text = str(read_url(url))
# text = text.replace(u'\u2014', u'–') # unicode bullshit
text = text.replace('\xa0', ' ')
text = unicodedata.normalize('NFKD', text)
with open('temp', 'w') as f:
f.write(text)
return BeautifulSoup(text, 'html.parser')

def enforce (condition, msg, *args):
if not condition:
raise Exception(msg % args)

def parse_department_link (a):
href = a['href'] #if 'href' in a else ''
#title = a['title'] if 'title' in a else ''
match = re.match(r'program.statements/([a-z]+\.html)', href)
enforce(match, "Unexpected link url: '%s'", href)
text = a.text.strip()
if text:
return text, href

def parse_department_links (links):
for link in links:
result = parse_department_link(link)
if result:
yield result

def fetch_department_urls (base_url = 'https://registrar.ucsc.edu/catalog/programs-courses'):
index_url = '%s/index.html'%base_url
soup = fetch_soup(index_url)
dept_anchor = soup.find('a', id='departments')
enforce(dept_anchor, "Could not find '%s/#departments'", index_url)
header = dept_anchor.parent
enforce(header.name == "h2", "Unexpected: is not a h2 tag (got '%s')", header.name)
table = header.findNext('tr')
enforce(table.name == "tr", "Expected element after heading to be table, not '%s'", table.name)
return {k: '%s/%s'%(base_url, v) for k, v in parse_department_links(table.find_all('a'))}

if __name__ == '__main__':
result = fetch_department_urls()
print("Found %s department(s):"%len(result))
for k, v in result.items():
print("%s: %s"%(k, v))
Loading