Syllabus Spring 2016

W4249 Spring 2016 Applied Data Science

Department of Statistics, Columbia University

Course Information

Classes: Wednesdays 2:40-5:25 PM, Room 903 SSW
Instructor: Tian Zheng (Office hours: Mondays 1:30-2:30 PM, or by appointments; Room 1007, SSW). Email: tian.zheng@columbia.edu
TA: Yuting Ma. ym2396@columbia.edu
Course website: http://courseworks.columbia.edu

Prerequisites

The pre-requisite for this course includes working knowledge in statistics and probability, data mining and some statistical modeling. Programming in R or Python is required.

Description

This course will incorporate knowledge and skills covered in a statistical curriculum with topics and projects in data science. Programming will covered using existing tools in R, while students can use tools from other languages. Computing best practices will be taught using test-driven development, version control, and collaboration. Students finish the class with a portfolio on GitHub, and deeper understanding of several core statistical/machine-learning algorithms.

This course will be a project-based hands-on course in data science. No formal instruction on statistics, data science, machine learning will be given. Every 2-3 weeks, we will have a mini data project. Groups will be formed randomly and projects will be peer-reviewed.

Course organization

This course will follow a sequence of four types of activities.

A. Dataset release, introduction to data science problem, team forming B. Lecture/tutorial C. Brainstorming, live hacking, code sharing D. Team Presentation, peer voting, winner announcement

Students will be working in teams of 4-5 students that will be randomly formed. For a meaningful experience in data science, students are expected to collaborate and work together on all the stages of a project. Code sharing and brainstorming are great opportunities to learn from each other.

We will have a total of five project cycles for this course:

Collaborative kaggle script project.
NYC open data visualization project.
Predictive analytics of images.
Relational (network) data analysis.
Free topic (multiple data sources will be provided).

Below is a tentative schedule we will follow.

Week 1 (1/20): 1a+1b
Week 2 (1/27): 1c
Week 3 (2/3): 1d+2a
Week 4 (2/10): 2b+2c
Week 5 (2/17): 2c
Week 6 (2/24): 2d+3a
Week 7 (3/2): 3b+3c
Week 8 (3/9): 3c
Week 9 (3/23): 3d+4a
Week 10 (3/30): 4b+4c
Week 11 (4/6): 4c
Week 12 (4/13): 4d+5a+5b
Week 13 (4/20): 5c
Week 14 (4/27): 5c+5d

Textbook

There is not a single required text. As part of this course, we will learn from what we can find online and in academic papers. Here are a couple of recommended reference books.

Mount and Zumel (2014) Practical data science with R.
Segaran (2007 )Programming collective intelligence: building smart web 2.0 applications.
Tuffe (2001) The visual display of quantitative information.
Fung (2013) Numbersense: how to use big data to your advantage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly