Skip to content

Merged changes (mostly from ucsc-data and firebase, and cleanup) b/c end of sprint 3 #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 67 commits into from
Jul 25, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
da06bdb
refactored crawler
SeijiEmery Jul 20, 2018
498585b
updated output to include all data
SeijiEmery Jul 20, 2018
fc56a79
wrote function to grab registrar index urls
SeijiEmery Jul 23, 2018
75225da
wrote course page fetcher
SeijiEmery Jul 23, 2018
7aa7003
started ucsc course parser
SeijiEmery Jul 23, 2018
82fd99e
started parsing course descrips
SeijiEmery Jul 23, 2018
040f929
rewrote fetch_course_pages to handle whitespace (and unicode) correctly
SeijiEmery Jul 23, 2018
4ba8118
cleanup; fully switched to python3 (was supporting python2 for idk wh…
SeijiEmery Jul 23, 2018
9cd4a93
cleaned up output
SeijiEmery Jul 23, 2018
556f836
started parsing ucsc courses
SeijiEmery Jul 23, 2018
56aec63
parsing course listings...
SeijiEmery Jul 23, 2018
0df6931
fixed annoying edge cases
SeijiEmery Jul 23, 2018
4cc1aab
more stupid edge cases
SeijiEmery Jul 23, 2018
aaaf573
blame the f***ing history dept (and film). acronyms !#$*^#*&!
SeijiEmery Jul 23, 2018
5836057
filtering out <p align="..."> tags failed for some reason for biochem…
SeijiEmery Jul 23, 2018
86251d6
finished edge cases...?!
SeijiEmery Jul 23, 2018
0d02cf4
parsed instructors
SeijiEmery Jul 23, 2018
531989d
added info log of courses by dept / div
SeijiEmery Jul 23, 2018
b0a8c79
working on proper prereq parser
SeijiEmery Jul 23, 2018
9db12d6
fixed index parsing (only was fetching half of all course listings...)
SeijiEmery Jul 23, 2018
70926b6
fixed history edge cases
SeijiEmery Jul 23, 2018
3fc4677
fixed weird edge case where the lit page has nested divs...
SeijiEmery Jul 23, 2018
8e14826
fixed edgecases
SeijiEmery Jul 23, 2018
d2c782a
used a terrible hack to fix output from the sociology page (first cou…
SeijiEmery Jul 23, 2018
d21121c
crappy solution to SPECIFICALLY fix a line break inside of a strong t…
SeijiEmery Jul 24, 2018
098a868
fixed it again b/c someone on the anthropology page fucked up and put…
SeijiEmery Jul 24, 2018
ee9110d
updated to skip comments (was not aware that this WASN'T skipping com…
SeijiEmery Jul 24, 2018
31c6b1f
added a ton of parse cases to filter out unneeded tokens
SeijiEmery Jul 24, 2018
c952b2c
update README.md
Kuahoo Jul 25, 2018
3ebf0da
Merge branch 'development' into ucsc-data
SeijiEmery Jul 25, 2018
3c1a163
Merge branch 'development' into sprint3
SeijiEmery Jul 25, 2018
8af0a1a
Merge remote-tracking branch 'origin/sprint3' into sprint3
SeijiEmery Jul 25, 2018
98bc764
removed data files
SeijiEmery Jul 25, 2018
e3eb1a2
refactored crawler
SeijiEmery Jul 20, 2018
162aa84
updated output to include all data
SeijiEmery Jul 20, 2018
dd35abd
Merge branch 'firebase' of https://github.com/coursegraph/CourseGraph…
SeijiEmery Jul 25, 2018
d057980
added gitignore for data outputs
SeijiEmery Jul 25, 2018
61580b6
wrote function to grab registrar index urls
SeijiEmery Jul 23, 2018
25ae2b5
wrote course page fetcher
SeijiEmery Jul 23, 2018
7128461
started ucsc course parser
SeijiEmery Jul 23, 2018
457769f
started parsing course descrips
SeijiEmery Jul 23, 2018
85480cf
rewrote fetch_course_pages to handle whitespace (and unicode) correctly
SeijiEmery Jul 23, 2018
df8a0ed
cleanup; fully switched to python3 (was supporting python2 for idk wh…
SeijiEmery Jul 23, 2018
12a8729
cleaned up output
SeijiEmery Jul 23, 2018
9c8761b
started parsing ucsc courses
SeijiEmery Jul 23, 2018
05d4e6e
parsing course listings...
SeijiEmery Jul 23, 2018
e5eb758
fixed annoying edge cases
SeijiEmery Jul 23, 2018
18305e9
more stupid edge cases
SeijiEmery Jul 23, 2018
3fddd5f
blame the f***ing history dept (and film). acronyms !#$*^#*&!
SeijiEmery Jul 23, 2018
8feefa5
filtering out <p align="..."> tags failed for some reason for biochem…
SeijiEmery Jul 23, 2018
084f610
finished edge cases...?!
SeijiEmery Jul 23, 2018
f72256a
parsed instructors
SeijiEmery Jul 23, 2018
0d4933a
added info log of courses by dept / div
SeijiEmery Jul 23, 2018
dd58b5d
working on proper prereq parser
SeijiEmery Jul 23, 2018
24f97ef
fixed index parsing (only was fetching half of all course listings...)
SeijiEmery Jul 23, 2018
97efea0
fixed history edge cases
SeijiEmery Jul 23, 2018
42aca19
fixed weird edge case where the lit page has nested divs...
SeijiEmery Jul 23, 2018
6467bd2
fixed edgecases
SeijiEmery Jul 23, 2018
a20f044
used a terrible hack to fix output from the sociology page (first cou…
SeijiEmery Jul 23, 2018
24c76b4
crappy solution to SPECIFICALLY fix a line break inside of a strong t…
SeijiEmery Jul 24, 2018
82435c4
fixed it again b/c someone on the anthropology page fucked up and put…
SeijiEmery Jul 24, 2018
d633682
updated to skip comments (was not aware that this WASN'T skipping com…
SeijiEmery Jul 24, 2018
0a4659e
added a ton of parse cases to filter out unneeded tokens
SeijiEmery Jul 24, 2018
0964fca
Merge branch 'ucsc-data' of https://github.com/coursegraph/CourseGrap…
SeijiEmery Jul 25, 2018
4fdae30
merged minor changes / refactoring done in firebase branch w/ ucsc-da…
SeijiEmery Jul 25, 2018
5af2901
removed data file
SeijiEmery Jul 25, 2018
f89788e
Merge pull request #44 from coursegraph/ucsc-data
SeijiEmery Jul 25, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,10 @@ package-lock.json

*.pyc
/.coveralls.yml

/crawlers/ucsc/data/*
/crawlers/ucsc/prereqs
/crawlers/ucsc/unparsed
crawlers/ucsc/temp
crawlers/ucsd/ucsd_courses.json
crawlers/ucsd/ucsd_graph_data.json
64 changes: 45 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ Solution? CourseGraph, a webapp that will:

Technology: we will need

+ a web frontend (probably React, Typescript, D3) and people interested in UX and software design (myself included)
+ a web backend (probably node) and people interested in backend development and data storage / retrieval
+ a web frontend (probably React, vis.js, material-ui) and people interested in UX and software design (myself included)
+ a web backend (probably node, mongoDB) and people interested in backend development and data storage / retrieval
+ several web crawlers to datamine UCSC sites and maybe others; anyone interested in this please apply!
+ possible integration of other web services (if we could embed eg. ratemyprofessors that would be awesome)

Expand All @@ -32,11 +32,11 @@ Is this feasible in <5 weeks?
+ Plus side is we all get to wear lots of hats and use a lot of cool tech to build a real tool that students and counselors can use to explore class options and make planning schedules a lot easier
+ This project can be subdivided with 2-3 teams working in parallel on different components (eg. frontend and data mining), so we should be able to work without too many bottlenecks

You do NOT need to have experience with typescript, react, node, or d3 to join this project, just a good attitude and a willingness to learn and contribute.
You do NOT need to have experience with react, node, or Vis to join this project, just a good attitude and a willingness to learn and contribute.

That said, you will need time to learn a bit of typescript and either frontend (react, d3), backend (node, databases – ask Ivan), or data mining (web crawlers, either node or python), since we'll probably be splitting into sub-teams that focus on one of those categories. And you'll need to do this fairly quickly (ie. over the next few weeks) since we'll need to hit the ground running as soon as possible. Oh, and if you'd like to do project management (as one of your many hats) that would be very useful too.
That said, you will need time to learn a bit of typescript and either frontend (react, vis.js), backend (node, databases – ask Ivan), or data mining (web crawlers, either node or python), since we'll probably be splitting into sub-teams that focus on one of those categories. And you'll need to do this fairly quickly (ie. over the next few weeks) since we'll need to hit the ground running as soon as possible. Oh, and if you'd like to do project management (as one of your many hats) that would be very useful too.

I'll be learning react and d3 over the next week or so, so if you're interested in that (whether you're a part of this team or not) please hit me up! (ssemery@ucsc.edu)
I'll be learning react and vis.js over the next week or so, so if you're interested in that (whether you're a part of this team or not) please hit me up! (ssemery@ucsc.edu)

## Getting Started

Expand All @@ -45,6 +45,7 @@ These instructions will get you a copy of the project up and running on your loc
### Prerequisites

[Node.js](https://nodejs.org/en/) - JavaScript runtime built on Chrome's V8 JavaScript engine.
[mongoDB] (https://docs.mongodb.com/manual/installation/) - MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling.

The minimum supported Node version is `v6.0.0` by default. (We are using `v10.0.0`).

Expand Down Expand Up @@ -89,29 +90,54 @@ npm run test

This command runs [`jest`](http://jestjs.io/) and [`enzyme`](http://airbnb.io/enzyme/), an incredibly useful testing utility.

### And coding style tests

We uses `TSLint`, just a command:

```
npm run pretest
```

## Built With

* [Next.js](https://nextjs.org/) - A lightweight framework for static and server‑rendered applications.
* [React](https://reactjs.org) - A JavaScript library for building user interfaces
* [Node.js](https://nodejs.org/en/) - A JavaScript runtime built on Chrome's V8 JavaScript engine.
* [MongoDB](https://www.mongodb.com/) - Build innovative modern applications that create a competitive advantage.

## Dependencies
* [Material-ui/core]
* [material-ui/icons]
* [algoliasearch]
* [bcrypt-nodejs]
* [body-parser]
* [compression]
* [connect-mongo'
* [crypto]
* [express]
* [express-flash]
* [express-session]
* [express-validator]
* [isomorphic-unfetch]
* [jss]
* [lru-cache]
* [mongoose]
* [next]
* [npgrogress]
* [passport]
* [passport-local]
* [prop-types]
* [qs]
* [react]
* [react-dom]
* [react-draggable]
* [react-graph-vis]
* [react-instantsearc]
* [react-jss]
* [reactjs-popup]
* [styled-jsx]


## Authors

* **Seiji Emery** ([SeijiEmery](https://github.com/SeijiEmery) ) -
* **Seiji Emery** ([SeijiEmery](https://github.com/SeijiEmery) ) - Lead Tech Developer
* **Yanwen Xu** ([RaiderSoap](https://github.com/RaiderSoap) ) - :floppy_disk: Back-End Developer
* **Patrick Lauderdale** ([ThePatrickLauderdale](https://github.com/ThePatrickLauderdale)) -
* **Sharad Shrestha** ([sharad97](https://github.com/sharad97) ) -
* **Wendy Liang** ([wendyrliang](https://github.com/wendyrliang) ) -
* **Ka Ho Tran** ([Kutaho](https://github.com/Kutaho) ) -
* **Patrick Lauderdale** ([ThePatrickLauderdale](https://github.com/ThePatrickLauderdale)) - FrontEnd Developer
* **Wendy Liang** ([wendyrliang](https://github.com/wendyrliang) ) - FrontEnd Developer
* **Ka Ho Tran** ([Kutaho](https://github.com/Kutaho) ) - FrontEnd Developer
* **Nikki Miller** ([NikMills](https;//github.com/nikmills)) - FrontEnd Developer

See also the list of [contributors](https://github.com/coursegraph/CourseGraph/settings/collaboration) who participated in this project.

Expand All @@ -123,4 +149,4 @@ This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md

Big thanks to Richard Jullig.

:kissing_heart:
:kissing_heart:
164 changes: 164 additions & 0 deletions crawlers/ucsc/fetch_course_pages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import re
from bs4 import BeautifulSoup, Comment
from urllib.request import HTTPError
from fetch_index import fetch_soup, enforce, fetch_department_urls
import os

def extract_text (element):
# this is REALLY f***ing ugly...
if isinstance(element, Comment):
return ''
elif element.name == 'p':
return '\n%s\n'%(u''.join(map(extract_text, element)))
elif element.name == 'div':
return '\n%s\n'%(u''.join(map(extract_text, element)))
elif element.name == 'br':
return '\n'
elif element.name == 'strong':
# This probably deserves some explaination. Ok, issues are as follows:
# – some idiot put a line break to separate stuff-that-should-be-separated in lgst.
# line break / paragraph element doesn't show up elsewhere, so we have to catch +
# address it here.
# - some other idiot put a line break in anthropology, separating a title that
# SHOULDN'T be separated
#
# So, we do the following:
# – we manually concatenate all of the inner text tags (b/c no way to do this otherwise)
# - if non-empty text is followed by a line break, we emit a '\n' afterwards
# - if not we don't, b/c there shouldn't be any good reason to put a <br /> inside of a
# strong tag given what the registrar page is supposed to look like...
text = ''
has_non_internal_line_break = False
for child in element:
if child.name == 'br':
has_non_internal_line_break = True
elif child.name == None:
text += child
has_non_internal_line_break = False
return text + '\n' if has_non_internal_line_break else text
elif element.name is None:
return '%s'%element
elif element.name == 'comment':
raise Exception("Skipping comment %s"%element.text)
else:
return element.text

def extract_sections (content, dept):
divisions = {}
text = ''
division = None
for child in content:
if child.name == 'h1' or child.name == 'h2' or child.name == 'h3' or child.name == 'h4':
match = re.match(r'^\s*([A-Z][a-z]+(?:\-[A-Z][a-z]+)*)\s+Courses', child.text)
enforce(match, "Expected header to be course heading, got '%s'", child.text)
if division:
divisions[division] = text
text = ''
division = match.group(1)
# print("Setting division: '%s'"%division)
elif division:
if child.name == 'p':
try:
test = child['align']
continue
except KeyError:
pass
text += extract_text(child)
if division:
divisions[division] = text

print("Listed Divisions: %s"%divisions.keys())

text = ''

# THIS IS A TERRIBLE HACK.
# Problem: the sociology page's intro course is missing a course number.
# Solution: this.
# This will break (hopefully) whenever the sociology fixes that page.
# Until then, uh...
if dept == 'socy':
divisions['Lower-Division'] = '1. '+divisions['Lower-Division']

for k, v in divisions.items():
text += '\nDIVISION %s\n%s'%(k, v)
return text

def fetch_dept_page_content (url):
try:
soup = fetch_soup(url)
content = soup.find("div", {"class": "content"})
text = extract_sections(content, url.split('/')[-1].split('.')[0])
enforce(text, "Empty page content: '%s'\nRaw content:\n%s", url, content.text)
text = text.replace('\\n', '')
text = '\n'.join([ line.strip() for line in text.split('\n') ])
return text
except HTTPError:
print("Failed to open department page '%s'"%url)
return None

class DepartmentPageEntry:
def __init__ (self, dept, title, url, content):
self.dept = dept.strip()
self.title = title.strip()
self.url = url.strip()
self.content = content

def __repr__ (self):
return '''[Department %s title '%s' url '%s' content (%d byte(s))'''%(
self.dept, self.title, self.url, len(self.content))

def fetch_department_course_pages (base_url = 'https://registrar.ucsc.edu/catalog/programs-courses', dept_urls = None):
if not dept_urls:
dept_urls = fetch_department_urls(base_url)
enforce(dept_urls, "Could not fetch department urls from index at base url '%s'", base_url)

for title, url in dept_urls.items():
page = url.split(u'/')[-1]
dept = page.split(u'.')[0]
url = u'%s/course-descriptions/%s'%(base_url, page)
print("Fetching '%s' => '%s'"%(title, url))
result = fetch_dept_page_content(url)
if result:
yield DepartmentPageEntry(dept, title, url, result)

def dump_department_pages_to_disk (path='data', base_url = 'https://registrar.ucsc.edu/catalog/programs-courses', dept_urls = None):
for dept in fetch_department_course_pages(base_url, dept_urls):
with open('%s/courses/%s'%(path, dept.dept), 'w') as f:
f.write(u'\n'.join([
dept.dept,
dept.title,
dept.url,
dept.content
]))

def fetch_courses_from_disk (path='data'):
for filename in os.listdir(u'%s/courses/'%path):
with open(u'%s/courses/%s'%(path, filename), 'r') as f:
lines = f.read().split('\n')
result = DepartmentPageEntry(
lines[0],
lines[1],
lines[2],
'\n'.join(lines[3:]))
print("Loaded %s: '%s', %s byte(s)"%(
result.dept, result.title, len(result.content)))
yield result

def fetch_course_pages (*args, **kwargs):
courses = list(fetch_courses_from_disk(*args, **kwargs))
if not courses:
print("No disk cache; refetching")
return fetch_department_course_pages(*args, **kwargs)
return courses


if __name__ == '__main__':
dump_department_pages_to_disk('data')
# dept_urls = fetch_department_urls()
# print("Got %s"%dept_urls)
# for dept in fetch_department_course_pages():
# print(dept)
# print(dept.content)
# print()
63 changes: 63 additions & 0 deletions crawlers/ucsc/fetch_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
import gzip
import io
import unicodedata

def read_url (url):
response = urlopen(url)
return response.read()
try:
buffer = io.StringIO(response.read())
result = gzip.GzipFile(fileobj=buffer)
return result.read().decode('utf8')
except IOError:
return response.read()#.encode('utf8')

def fetch_soup (url):
text = str(read_url(url))
# text = text.replace(u'\u2014', u'–') # unicode bullshit
text = text.replace('\xa0', ' ')
text = unicodedata.normalize('NFKD', text)
with open('temp', 'w') as f:
f.write(text)
return BeautifulSoup(text, 'html.parser')

def enforce (condition, msg, *args):
if not condition:
raise Exception(msg % args)

def parse_department_link (a):
href = a['href'] #if 'href' in a else ''
#title = a['title'] if 'title' in a else ''
match = re.match(r'program.statements/([a-z]+\.html)', href)
enforce(match, "Unexpected link url: '%s'", href)
text = a.text.strip()
if text:
return text, href

def parse_department_links (links):
for link in links:
result = parse_department_link(link)
if result:
yield result

def fetch_department_urls (base_url = 'https://registrar.ucsc.edu/catalog/programs-courses'):
index_url = '%s/index.html'%base_url
soup = fetch_soup(index_url)
dept_anchor = soup.find('a', id='departments')
enforce(dept_anchor, "Could not find '%s/#departments'", index_url)
header = dept_anchor.parent
enforce(header.name == "h2", "Unexpected: is not a h2 tag (got '%s')", header.name)
table = header.findNext('tr')
enforce(table.name == "tr", "Expected element after heading to be table, not '%s'", table.name)
return {k: '%s/%s'%(base_url, v) for k, v in parse_department_links(table.find_all('a'))}

if __name__ == '__main__':
result = fetch_department_urls()
print("Found %s department(s):"%len(result))
for k, v in result.items():
print("%s: %s"%(k, v))
Loading