-
Notifications
You must be signed in to change notification settings - Fork 51
/
001_intro.Rmd
executable file
·100 lines (54 loc) · 9.44 KB
/
001_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
<style>@import url(style.css);</style>
[Introduction to Data Analysis](index.html "Course index")
# Introduction
This is an introductory-level course on data analysis. We think that you will like this course if you enjoy facts, graphics and being able to make things by yourself. You will also need to enjoy computing and statistics to some extent. Do not be afraid, neither bite.
## How we got there
We had four different ideas before we launched the course in its current form.
__Our first idea__ was to offer an optional course that would pick you up where your first-semester mathematics and economics courses left you. Since both courses brushed upon linear models as part of their programme on empirical methods, we thought: "Let's write an introductory stats course that builds around linear models, with some background and extensions."
(Linear models are a class of statistical techniques usually referred to as "linear regression". We come back to [linear models](80_lin.html) during the course.)
__Our second idea__ was that this course should function as a workshop where you would learn to produce some of the analysis that you saw in slides and handbooks during your first term. Since the course is new, we thought: "Let's write an introductory stats course using a free software that students can keep using their whole lives."
At that stage, we decided to write the course with R, a software that we will present at more length in [a later section](10_r.html). R is taught at several universities around the world.
__Our third idea__ derived from the complexity of R, which is less a "stats software" than a statistical *computing* software. You can reach pretty deep into programming with R, and this shows up in the way that you need to operate it to make things happen. Since this could also appeal to some, we thought: "Let's write an introductory stat computing course."
(Statistical computing means using your computer to analyze data. [Here's what it looks like](http://www.stat.cmu.edu/~cshalizi/statcomp/) for people your age who study statistics at Carnegie Mellon University.)
__The fourth idea__ came from the quick realisation that we were riding several horses at once: statistics, computing, visualization and social science. So we thought: "Let's write a (whatever) course with R, and give it a name later on." That name ended up being "data analysis".
You'll see that the course still tastes a bit like all of the ideas above: there's a bit of math, lots of stats, tons of computing, and, we hope, even more graphs. But fundamentally, we want to make it a course about data analysis, and have you learn a set of skills in this area.
So, …
## What is data analysis?
Here's [a video](https://www.youtube.com/watch?v=4gIzG-tB22o) by Jeff Leek, an assistant professor in the Biostatistics Department of the Johns Hopkins Bloomberg School of Public Health. The video explains how data analysis relates to other study topics.
<iframe width="800" height="600" src="https://www.youtube.com/embed/4gIzG-tB22o?rel=0" frameborder="0" allowfullscreen></iframe>
Jeff Leek blogs at [Simply Statistics](http://simplystatistics.org/), where his fellow biostatistician Rafael Irizarry has recently explained [why statistics is not mathematics](http://simplystatistics.org/2013/01/11/nsf-should-understand-that-statistics-in-not-mathematics/), On that topic, see also the [links posted by Jerzy Wieczorek](http://civilstat.com/?p=1165), a statistician working at the U.S. Census Bureau.
Here's another 20-minute video by Hans Rosling that will teach you some amazing vital statistics, and also explain how "liberating data" matters for development and organizations like the United Nations, the World Bank, NGOs and national or local governments:
<iframe width="800" height="600" src="https://www.youtube.com/embed/hVimVzgtD6w?rel=0" frameborder="0" allowfullscreen></iframe>
Hans Rosling and his son developed [Gapminder](http://www.gapminder.org/), a fantastic tool to animate data. Rosling is the main protagonist in the BBC documentary *[The Joy of Stats](https://www.youtube.com/watch?v=jbkSRLYSojo)*. He says: "I kid you not, statistics is now the sexiest subject on the planet". [The *New York Times* agrees](http://www.nytimes.com/2009/08/06/technology/06stats.html).
And last, here's an hour-long video of [Andrew Gelman](http://andrewgelman.com/), a professor of statistics and political science at Columbia University, talking at Google about his book *[Rich State Blue State: Why Americans Vote the Way They Do](http://press.princeton.edu/titles/9030.html)* from 2009:
<iframe width="800" height="450" src="https://www.youtube.com/embed/5JYiJwDob1w?rel=0" frameborder="0" allowfullscreen></iframe>
These videos give you a taste of how data, statistics and visualization are affecting the teaching and use of these tools in academia, but also in many other professions. These developments are relevant to all social sciences, whether it be public health, political science, economics…
Finally, this course is heavily influenced by the development of data visualization. To get a taste of what "dataviz" is, see [this selection of methods](http://selection.datavisualization.ch/), read some articles from [Robert Kosara's blog](http://eagereyes.org/) on computational graphics, and see [Simon Jackman's neat graphs](http://jackman.stanford.edu/blog/?p=2620) of Barack Obama's electoral victory in 2012 for a brilliant illustration.
## Who is this for
This course aims at showing how data analysis and visualization works in practice, and how it feeds into current trends in data journalism, open government and reproducible science.
Within that general perspective, we pursue the following teaching goals:
1. __The primary aim of the course__ is to provide you with some skills to understand how statistical computing can contribute to the analysis and visualization of real-world data.
2. __The secondary aim of the course__ is to show you how a healthy dose of programming can help taking into account large amounts of evidence about the economy, politics and society.
3. __Third and last__, this course intends to provide you with a practical skill that will make you a strong applicant at universities that require quantitative methods training.
The skills outlined above will enhance your substantive expertise in the social sciences with computer and quantitative skills. This diagram by [Drew Conway][dc-vd] shows how all these skills collide around data analysis, or "data science":
[![Drew Conway's Data Science Venn Diagram (Conway-Data_Science_VD)](images/data-science-conway.png)][dc-vd]
[dc-vd]: http://www.drewconway.com/zia/?p=2378
__You will find the course challenging__ because you will learn a lot of things at the same time, including some computer programming, a topic that is probably new to you. We assume *zero* previous knowledge and will proceed slowly, in a step-by-step fashion.
## Course outline
The breakup of our twelve sessions together goes roughly as follows:
1. The course starts with the __R software basics__ (Sessions 1--4). We'll cover R syntax, objects and data operators, plus a bit of math.
2. The course continues with a bunch of __statistical methods__ (Sessions 5--8). We'll cover clusters, distributions, hypothesis tests and linear models.
3. The course ends on slightly more advanced __visualization techniques__ (Sessions 9--12). We'll cover time series, networks and maps.
The class will conclude on the professional democratization of data science, possibly with one or more guest speakers who use data in their daily work routines.
The list of topics is tentative at best, and all course sessions are still in the works.
## Course requirements
The main requirements for the course are as follows:
* __You have to be able to provide feedback.__ We are running the course as a kind of teaching experiment, and your preferences in terms of topics will have an influence on the contents of the sessions, so we need your input.
* __You have to be able to learn autonomously.__ Take this course to work on your skills at independent problem-solving, which includes [using Google responsibly](http://agbeat.com/real-estate-technology-new-media/google-cheat-sheet-every-trick-you-need-to-master-search/) to find help and learning trough trial-and-error.
* __You have to be able to deal with coursework.__ Just like any other class, this one assumes that you can handle some weekly readings and exercises, as well as work in groups over a few small-scale projects.
Last thing: start all your __email subjects__ with "`IDA:`" when emailing the instructors, and attach the code and data that correspond to your issue, so that we can recreate it on our end.
## Disclosure
__This course is a teaching experiment__: we really, *really* need your feedback to make it work. Specifically, we'd like you to tell us what you want to do with the skills that we will teach you, so that we can help you get there.
__We'll suggest exercises__ like drawing plots to illustrate your student newspaper, or building small-scale datasets out of student networks. But we are very open to all suggestions: tell us what you want to achieve, and we'll do our best to make it happen if it's reasonable.
__We hope to present the results of this course__ at the [R 2013](http://r2013-lyon.sciencesconf.org/) conference, in order to report on how well (or how badly!) the experiment ran. Your identity and grades will naturally *not* be disclosed, but we will take notes on your feedback during class and use it to write our report.
> __Next__: [Readings](002_readings.html).