-
Notifications
You must be signed in to change notification settings - Fork 25
Syllabus Spring 2016
- Classes: Wednesdays 2:40-5:25 PM, Room 903 SSW
- Instructor: Tian Zheng (Office hours: Mondays 1:30-2:30 PM, or by appointments; Room 1007, SSW). Email: tian.zheng@columbia.edu
- TA: Yuting Ma. ym2396@columbia.edu
- Course website: http://courseworks.columbia.edu
The pre-requisite for this course includes working knowledge in statistics and probability, data mining and some statistical modeling. Programming in R or Python is required.
This course will incorporate knowledge and skills covered in a statistical curriculum with topics and projects in data science. Programming will covered using existing tools in R, while students can use tools from other languages. Computing best practices will be taught using test-driven development, version control, and collaboration. Students finish the class with a portfolio on GitHub, and deeper understanding of several core statistical/machine-learning algorithms.
This course will be a project-based hands-on course in data science. No formal instruction on statistics, data science, machine learning will be given. Every 2-3 weeks, we will have a mini data project. Groups will be formed randomly and projects will be peer-reviewed.
This course will follow a sequence of four types of activities.
A. Dataset release, introduction to data science problem, team forming B. Lecture/tutorial C. Brainstorming, live hacking, code sharing D. Team Presentation, peer voting, winner announcement
Students will be working in teams of 4-5 students that will be randomly formed. For a meaningful experience in data science, students are expected to collaborate and work together on all the stages of a project. Code sharing and brainstorming are great opportunities to learn from each other.
We will have a total of five project cycles for this course:
- Collaborative kaggle script project.
- NYC open data visualization project.
- Predictive analytics of images.
- Relational (network) data analysis.
- Free topic (multiple data sources will be provided).
Below is a tentative schedule we will follow.
- Week 1 (1/20): 1a+1b
- Week 2 (1/27): 1c
- Week 3 (2/3): 1d+2a
- Week 4 (2/10): 2b+2c
- Week 5 (2/17): 2c
- Week 6 (2/24): 2d+3a
- Week 7 (3/2): 3b+3c
- Week 8 (3/9): 3c
- Week 9 (3/23): 3d+4a
- Week 10 (3/30): 4b+4c
- Week 11 (4/6): 4c
- Week 12 (4/13): 4d+5a+5b
- Week 13 (4/20): 5c
- Week 14 (4/27): 5c+5d
There is not a single required text. As part of this course, we will learn from what we can find online and in academic papers. Here are a couple of recommended reference books.
- Mount and Zumel (2014) Practical data science with R.
- Segaran (2007 )Programming collective intelligence: building smart web 2.0 applications.
- Tuffe (2001) The visual display of quantitative information.
- Fung (2013) Numbersense: how to use big data to your advantage.