coursegraph · SeijiEmery · Jul 25, 2018 · Jul 20, 2018 · Jul 20, 2018 · Jul 23, 2018
diff --git a/.gitignore b/.gitignore
@@ -23,3 +23,10 @@ package-lock.json
 
 *.pyc
 /.coveralls.yml
+
+/crawlers/ucsc/data/*
+/crawlers/ucsc/prereqs
+/crawlers/ucsc/unparsed
+crawlers/ucsc/temp
+crawlers/ucsd/ucsd_courses.json
+crawlers/ucsd/ucsd_graph_data.json
diff --git a/README.md b/README.md
@@ -21,8 +21,8 @@ Solution? CourseGraph, a webapp that will:
 
 Technology: we will need
 
- + a web frontend (probably React, Typescript, D3) and people interested in UX and software design (myself included)
- + a web backend (probably node) and people interested in backend development and data storage / retrieval
+ + a web frontend (probably React, vis.js, material-ui) and people interested in UX and software design (myself included)
+ + a web backend (probably node, mongoDB) and people interested in backend development and data storage / retrieval
  + several web crawlers to datamine UCSC sites and maybe others; anyone interested in this please apply!
 + possible integration of other web services (if we could embed eg. ratemyprofessors that would be awesome)
 
@@ -32,11 +32,11 @@ Is this feasible in <5 weeks?
  + Plus side is we all get to wear lots of hats and use a lot of cool tech to build a real tool that students and counselors can use to explore class options and make planning schedules a lot easier
  + This project can be subdivided with 2-3 teams working in parallel on different components (eg. frontend and data mining), so we should be able to work without too many bottlenecks
 
-You do NOT need to have experience with typescript, react, node, or d3 to join this project, just a good attitude and a willingness to learn and contribute.
+You do NOT need to have experience with react, node, or Vis to join this project, just a good attitude and a willingness to learn and contribute.
 
-That said, you will need time to learn a bit of typescript and either frontend (react, d3), backend (node, databases – ask Ivan), or data mining (web crawlers, either node or python), since we'll probably be splitting into sub-teams that focus on one of those categories. And you'll need to do this fairly quickly (ie. over the next few weeks) since we'll need to hit the ground running as soon as possible. Oh, and if you'd like to do project management (as one of your many hats) that would be very useful too.
+That said, you will need time to learn a bit of typescript and either frontend (react, vis.js), backend (node, databases – ask Ivan), or data mining (web crawlers, either node or python), since we'll probably be splitting into sub-teams that focus on one of those categories. And you'll need to do this fairly quickly (ie. over the next few weeks) since we'll need to hit the ground running as soon as possible. Oh, and if you'd like to do project management (as one of your many hats) that would be very useful too.
 
-I'll be learning react and d3 over the next week or so, so if you're interested in that (whether you're a part of this team or not) please hit me up! (ssemery@ucsc.edu)
+I'll be learning react and vis.js over the next week or so, so if you're interested in that (whether you're a part of this team or not) please hit me up! (ssemery@ucsc.edu)
 
 ## Getting Started
 
@@ -45,6 +45,7 @@ These instructions will get you a copy of the project up and running on your loc
 ### Prerequisites
 
 [Node.js](https://nodejs.org/en/) - JavaScript runtime built on Chrome's V8 JavaScript engine.
+[mongoDB] (https://docs.mongodb.com/manual/installation/) - MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling.
 
 The minimum supported Node version is `v6.0.0` by default. (We are using `v10.0.0`).
 
@@ -89,29 +90,54 @@ npm run test
 
 This command runs [`jest`](http://jestjs.io/) and [`enzyme`](http://airbnb.io/enzyme/), an incredibly useful testing utility.
 
-### And coding style tests
-
-We uses `TSLint`, just a command:
-
-```
-npm run pretest
-```
-
 ## Built With
 
 * [Next.js](https://nextjs.org/) - A lightweight framework for static and server‑rendered applications.
 * [React](https://reactjs.org) - A JavaScript library for building user interfaces
 * [Node.js](https://nodejs.org/en/) - A JavaScript runtime built on Chrome's V8 JavaScript engine.
 * [MongoDB](https://www.mongodb.com/) - Build innovative modern applications that create a competitive advantage.
 
+## Dependencies
+* [Material-ui/core]
+* [material-ui/icons]
+* [algoliasearch]
+* [bcrypt-nodejs]
+* [body-parser]
+* [compression]
+* [connect-mongo'
+* [crypto]
+* [express]
+* [express-flash]
+* [express-session]
+* [express-validator]
+* [isomorphic-unfetch]
+* [jss]
+* [lru-cache]
+* [mongoose]
+* [next]
+* [npgrogress]
+* [passport]
+* [passport-local]
+* [prop-types]
+* [qs]
+* [react]
+* [react-dom]
+* [react-draggable]
+* [react-graph-vis]
+* [react-instantsearc]
+* [react-jss]
+* [reactjs-popup]
+* [styled-jsx]
+
+
 ## Authors
 
-* **Seiji Emery** ([SeijiEmery](https://github.com/SeijiEmery) ) -
+* **Seiji Emery** ([SeijiEmery](https://github.com/SeijiEmery) ) - Lead Tech Developer
 * **Yanwen Xu** ([RaiderSoap](https://github.com/RaiderSoap) ) - :floppy_disk: Back-End Developer
-* **Patrick Lauderdale** ([ThePatrickLauderdale](https://github.com/ThePatrickLauderdale)) -
-* **Sharad Shrestha** ([sharad97](https://github.com/sharad97) ) -
-* **Wendy Liang** ([wendyrliang](https://github.com/wendyrliang) ) -
-* **Ka Ho Tran** ([Kutaho](https://github.com/Kutaho) ) -
+* **Patrick Lauderdale** ([ThePatrickLauderdale](https://github.com/ThePatrickLauderdale)) - FrontEnd Developer
+* **Wendy Liang** ([wendyrliang](https://github.com/wendyrliang) ) - FrontEnd Developer
+* **Ka Ho Tran** ([Kutaho](https://github.com/Kutaho) ) - FrontEnd Developer
+* **Nikki Miller** ([NikMills](https;//github.com/nikmills)) - FrontEnd Developer
 
 See also the list of [contributors](https://github.com/coursegraph/CourseGraph/settings/collaboration) who participated in this project.
 
@@ -123,4 +149,4 @@ This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md
 
 Big thanks to Richard Jullig.
 
-:kissing_heart:
+:kissing_heart:
diff --git a/crawlers/ucsc/fetch_course_pages.py b/crawlers/ucsc/fetch_course_pages.py
@@ -0,0 +1,164 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+import re
+from bs4 import BeautifulSoup, Comment
+from urllib.request import HTTPError
+from fetch_index import fetch_soup, enforce, fetch_department_urls
+import os
+
+def extract_text (element):
+    # this is REALLY f***ing ugly...
+    if isinstance(element, Comment):
+        return ''
+    elif element.name == 'p':
+        return '\n%s\n'%(u''.join(map(extract_text, element)))
+    elif element.name == 'div':
+        return '\n%s\n'%(u''.join(map(extract_text, element)))
+    elif element.name == 'br':
+        return '\n'
+    elif element.name == 'strong':
+        # This probably deserves some explaination. Ok, issues are as follows:
+        #  – some idiot put a line break to separate stuff-that-should-be-separated in lgst.
+        #    line break / paragraph element doesn't show up elsewhere, so we have to catch + 
+        #    address it here.
+        #  - some other idiot put a line break in anthropology, separating a title that
+        #    SHOULDN'T be separated
+        #
+        # So, we do the following:
+        #  – we manually concatenate all of the inner text tags (b/c no way to do this otherwise)
+        #  - if non-empty text is followed by a line break, we emit a '\n' afterwards
+        #  - if not we don't, b/c there shouldn't be any good reason to put a <br /> inside of a 
+        #    strong tag given what the registrar page is supposed to look like...
+        text = ''
+        has_non_internal_line_break = False
+        for child in element:
+            if child.name == 'br':
+                has_non_internal_line_break = True
+            elif child.name == None:
+                text += child
+                has_non_internal_line_break = False
+        return text + '\n' if has_non_internal_line_break else text
+    elif element.name is None:
+        return '%s'%element
+    elif element.name == 'comment':
+        raise Exception("Skipping comment %s"%element.text)
+    else:
+        return element.text
+
+def extract_sections (content, dept):
+    divisions = {}
+    text = ''
+    division = None
+    for child in content:
+        if child.name == 'h1' or child.name == 'h2' or child.name == 'h3' or child.name == 'h4':
+            match = re.match(r'^\s*([A-Z][a-z]+(?:\-[A-Z][a-z]+)*)\s+Courses', child.text)
+            enforce(match, "Expected header to be course heading, got '%s'", child.text)
+            if division:
+                divisions[division] = text
+                text = ''
+            division = match.group(1)
+            # print("Setting division: '%s'"%division)
+        elif division:
+            if child.name == 'p':
+                try:
+                    test = child['align']
+                    continue
+                except KeyError:
+                    pass
+            text += extract_text(child)
+    if division:
+        divisions[division] = text
+
+    print("Listed Divisions: %s"%divisions.keys())
+
+    text = ''
+
+    # THIS IS A TERRIBLE HACK.
+    # Problem: the sociology page's intro course is missing a course number.
+    # Solution: this.
+    # This will break (hopefully) whenever the sociology fixes that page.
+    # Until then, uh...
+    if dept == 'socy':
+        divisions['Lower-Division'] = '1. '+divisions['Lower-Division']
+
+    for k, v in divisions.items():
+        text += '\nDIVISION %s\n%s'%(k, v)
+    return text
+
+def fetch_dept_page_content (url):
+    try:
+        soup = fetch_soup(url)  
+        content = soup.find("div", {"class": "content"})
+        text = extract_sections(content, url.split('/')[-1].split('.')[0])
+        enforce(text, "Empty page content: '%s'\nRaw content:\n%s", url, content.text)
+        text = text.replace('\\n', '')
+        text = '\n'.join([ line.strip() for line in text.split('\n') ])
+        return text
+    except HTTPError:
+        print("Failed to open department page '%s'"%url)
+        return None
+
+class DepartmentPageEntry:
+    def __init__ (self, dept, title, url, content):
+        self.dept = dept.strip()
+        self.title = title.strip()
+        self.url = url.strip()
+        self.content = content
+
+    def __repr__ (self):
+        return '''[Department %s title '%s' url '%s' content (%d byte(s))'''%(
+            self.dept, self.title, self.url, len(self.content))
+
+def fetch_department_course_pages (base_url = 'https://registrar.ucsc.edu/catalog/programs-courses', dept_urls = None):
+    if not dept_urls:
+        dept_urls = fetch_department_urls(base_url)
+        enforce(dept_urls, "Could not fetch department urls from index at base url '%s'", base_url)
+
+    for title, url in dept_urls.items():
+        page = url.split(u'/')[-1]
+        dept = page.split(u'.')[0]
+        url = u'%s/course-descriptions/%s'%(base_url, page)
+        print("Fetching '%s' => '%s'"%(title, url))
+        result = fetch_dept_page_content(url)
+        if result:
+            yield DepartmentPageEntry(dept, title, url, result)
+
+def dump_department_pages_to_disk (path='data', base_url = 'https://registrar.ucsc.edu/catalog/programs-courses', dept_urls = None):
+    for dept in fetch_department_course_pages(base_url, dept_urls):
+        with open('%s/courses/%s'%(path, dept.dept), 'w') as f:
+            f.write(u'\n'.join([
+                dept.dept,
+                dept.title,
+                dept.url,
+                dept.content
+            ]))
+
+def fetch_courses_from_disk (path='data'):
+    for filename in os.listdir(u'%s/courses/'%path):
+        with open(u'%s/courses/%s'%(path, filename), 'r') as f:
+            lines = f.read().split('\n')
+            result = DepartmentPageEntry(
+                lines[0], 
+                lines[1], 
+                lines[2],
+                '\n'.join(lines[3:]))
+            print("Loaded %s: '%s', %s byte(s)"%(
+                result.dept, result.title, len(result.content)))
+            yield result
+
+def fetch_course_pages (*args, **kwargs):
+    courses = list(fetch_courses_from_disk(*args, **kwargs))
+    if not courses:
+        print("No disk cache; refetching")
+        return fetch_department_course_pages(*args, **kwargs)
+    return courses
+
+
+if __name__ == '__main__':
+    dump_department_pages_to_disk('data')
+    # dept_urls = fetch_department_urls()
+    # print("Got %s"%dept_urls)
+    # for dept in fetch_department_course_pages():
+    #     print(dept)
+    #     print(dept.content)
+    #     print()
diff --git a/crawlers/ucsc/fetch_index.py b/crawlers/ucsc/fetch_index.py
@@ -0,0 +1,63 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+import re
+from urllib.request import urlopen
+from bs4 import BeautifulSoup
+import gzip
+import io
+import unicodedata
+
+def read_url (url):
+    response = urlopen(url)
+    return response.read()
+    try:
+        buffer = io.StringIO(response.read())
+        result = gzip.GzipFile(fileobj=buffer)
+        return result.read().decode('utf8')
+    except IOError:
+        return response.read()#.encode('utf8')
+
+def fetch_soup (url):
+    text = str(read_url(url))
+    # text = text.replace(u'\u2014', u'–') # unicode bullshit
+    text = text.replace('\xa0', ' ')
+    text = unicodedata.normalize('NFKD', text)
+    with open('temp', 'w') as f:
+        f.write(text)
+    return BeautifulSoup(text, 'html.parser')
+
+def enforce (condition, msg, *args):
+    if not condition:
+        raise Exception(msg % args)
+
+def parse_department_link (a):
+    href = a['href'] #if 'href' in a else ''
+    #title = a['title'] if 'title' in a else ''
+    match = re.match(r'program.statements/([a-z]+\.html)', href)
+    enforce(match, "Unexpected link url: '%s'", href)
+    text = a.text.strip()
+    if text:
+        return text, href
+
+def parse_department_links (links):
+    for link in links:
+        result = parse_department_link(link)
+        if result:
+            yield result
+
+def fetch_department_urls (base_url = 'https://registrar.ucsc.edu/catalog/programs-courses'):
+    index_url = '%s/index.html'%base_url
+    soup = fetch_soup(index_url)
+    dept_anchor = soup.find('a', id='departments')
+    enforce(dept_anchor, "Could not find '%s/#departments'", index_url)
+    header = dept_anchor.parent
+    enforce(header.name == "h2", "Unexpected: is not a h2 tag (got '%s')", header.name)
+    table = header.findNext('tr')
+    enforce(table.name == "tr", "Expected element after heading to be table, not '%s'", table.name)
+    return {k: '%s/%s'%(base_url, v) for k, v in parse_department_links(table.find_all('a'))}
+
+if __name__ == '__main__':
+    result = fetch_department_urls()
+    print("Found %s department(s):"%len(result))
+    for k, v in result.items():
+        print("%s: %s"%(k, v))