Raphael Gottardo
December 26, 2014
What is EDA?
- Statistical practice concerned with (among other things): uncover underlying structure, extract important variables, detect outliers and anomalies, test underlying assumptions, develop models
- Named by John Tukey
- Extremely Important
R provides a powerful environment for EDA and visualization
- Mostly graphical
- Plotting the raw data (histograms, scatterplots, etc.)
- Plotting simple statistics such as means, standard deviations, medians, box plots, etc
- Positioning such plots so as to maximize our natural pattern-recognition abilities
- A clear picture is worth a thousand words!
- Avoid 3-D graphics
- Don’t show too much information on the same graph (color, patterns, etc)
- Stay away from Excel, Excel is not a statistics package!
- R provides a great environment for EDA with good graphics capabilities
Generic function for plotting of objects in R is plot
. The output will vary depending on the object type (e.g. scatter-plot, pair-plot, etc)
# Load the Iris data
data(iris)
plot(iris)
R provides many other graphics capabilities, such as histograms, boxplots, etc.
hist(iris$Sepal.Length)
boxplot(iris)
Anything wrong with the Species boxplot?
While R's graphics is very powerful and fully customizable, it can be difficult and painful to get to the desired plot.
For example, there are many different parameters, not necessarily consistent across function, etc.
Friends don't let friends use R base graphics!
ggplot2 provides a much richer and consistent environment for graphics in R.
Description from Hadley Wickham:
ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
Provides lots of different geometrics for data visualization:
See http://docs.ggplot2.org/current/ for details.
ggplot2 provides two main APIs: qplot
and ggplot
qplot
is basically a replacement for plot
but provide a cleaner output with a consistent interface.
# Load library
library(ggplot2)
# Using the variables directly
qplot(iris$Sepal.Length, iris$Sepal.Width)
# Can also use a dataframe
qplot(Sepal.Length, Sepal.Width, data = iris)
Plot types can be specified with the geom
option
# Points
qplot(Sepal.Length, Sepal.Width, data = iris, geom = "point")
# line
qplot(Sepal.Length, Sepal.Width, data = iris, geom = "line")
Plot types can be specified with the geom
option, multiple types can be combined
# Points and line
qplot(Sepal.Length, Sepal.Width, data = iris, geom = c("line", "point"))
# Points and line
qplot(x = Species, y = Sepal.Length, data = iris, geom = c("boxplot", "point"))
Notice anything wrong on the plot above? Try changing the point
option with jitter
.
In ggplot2, additional variables can be mapped to plot aesthetics including color
, fill
, shape
, size
, alpha
, linetype
.
# Points and line
qplot(x = Species, y = Sepal.Length, data = iris, geom = c("boxplot", "jitter"),
color = Sepal.Width)
# Points and line
qplot(x = Sepal.Width, y = Sepal.Length, data = iris, geom = c("point", "smooth"),
color = Species, size = Petal.Width, method = "lm")
Sometimes it can be convenient to visualize some characteristic of a dataset conditioning on the levels of some other variable.
Such feature is readily available in ggplot2 using the facets
argument.
# Points and line
qplot(x = Sepal.Width, y = Sepal.Length, data = iris, geom = c("point", "smooth"),
color = Species, size = Petal.Width, method = "lm", facets = ~Species)
Or if you prefer to facet by rows
# Points and line
qplot(x = Sepal.Width, y = Sepal.Length, data = iris, geom = c("point", "smooth"),
color = Species, size = Petal.Width, method = "lm", facets = Species ~ .)
It is often necessary to reshape (e.g. pivot) your data before analysis. This can easily be done in R using the reshape2
package.
This package provides main functions melt
and *cast
. melt
basically "melts" a dataframe in wide format into a long format. *cast
goes in the other direction.
Let's revisite our iris
dataset.
# We first load the library
library(reshape2)
# Only display the first few lines
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
We can see in the data above that we are measuring both width and length on two different flower characteristics: Sepal, and Petal. So we could store the same information with only one length (resp. width) column and an additional variable for type (Sepal/Petal).
The melt
function provides some good default options that will try to best guess how to "melt" the data.
# We first need to add a column to keep track of the flower
iris$flower_id <- rownames(iris)
# Default options
iris_melted <- melt(iris)
## Using Species, flower_id as id variables
head(iris_melted)
## Species flower_id variable value
## 1 setosa 1 Sepal.Length 5.1
## 2 setosa 2 Sepal.Length 4.9
## 3 setosa 3 Sepal.Length 4.7
## 4 setosa 4 Sepal.Length 4.6
## 5 setosa 5 Sepal.Length 5.0
## 6 setosa 6 Sepal.Length 5.4
# We first split that variable to get the columns we need
split_variable <- strsplit(as.character(iris_melted$variable), split = "\\.")
# Create two new variables
iris_melted$flower_part <- sapply(split_variable, "[", 1)
iris_melted$measurement_type <- sapply(split_variable, "[", 2)
# Remove the one we don't need anymore
iris_melted$variable <- NULL
head(iris_melted)
## Species flower_id value flower_part measurement_type
## 1 setosa 1 5.1 Sepal Length
## 2 setosa 2 4.9 Sepal Length
## 3 setosa 3 4.7 Sepal Length
## 4 setosa 4 4.6 Sepal Length
## 5 setosa 5 5.0 Sepal Length
## 6 setosa 6 5.4 Sepal Length
This is close but not quite what we want, let's see if cast can help us do what we need.
Use acast
or dcast
depending on whether you want vector/matrix/array output or data frame output. Data frames can have at most two dimensions.
iris_cast <- dcast(iris_melted, formula = flower_id + Species + flower_part ~ measurement_type)
head(iris_cast)
## flower_id Species flower_part Length Width
## 1 1 setosa Petal 1.4 0.2
## 2 1 setosa Sepal 5.1 3.5
## 3 10 setosa Petal 1.5 0.1
## 4 10 setosa Sepal 4.9 3.1
## 5 100 versicolor Petal 4.1 1.3
## 6 100 versicolor Sepal 5.7 2.8
Q: Why are the elements of flower_id
not properly ordered?
melt
and *cast
are very powerful. These can also be used on data.tables
. More on this latter.
Exercise: Try to reorder the variable names in the formula. What happens?
Using our long format dataframe, we can further explore the iris dataset.
# We can now facet by Species and Petal/Sepal
qplot(x = Width, y = Length, data = iris_cast, geom = c("point", "smooth"), color = Species,
method = "lm", facets = flower_part ~ Species)
It would be nice to see if we could have free scales for the panels, but before we explore this, let's talk about the ggplot
API as an alternative to qplot. Can we also customize the look and feel?
ggplot2
provides another API via the ggplot
command, that directly implements the idea of a "grammar of graphics".
The grammar defines the components of a plot as:
- a default dataset and set of mappings from variables to aesthetics,
- one or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings,
- one scale for each aesthetic mapping used,
- a coordinate system,
- the facet specification.
ggplot(data=iris_cast, aes(x=Width, y=Length))+ # Add points and use free scales in the facet
geom_point()+facet_grid(Species~flower_part, scale="free")+
# Add a regression line
geom_smooth(method="lm")+
# Use the black/white theme and increase the font size
theme_bw(base_size=24)
Let's try to map one of the variable to the shape
my_plot <- ggplot(data = iris_cast, aes(x = Width, y = Length, shape = flower_part,
color = flower_part)) + # Add points
geom_point() + facet_grid(~Species) + geom_smooth(method = "lm")
my_plot
Note: the ggplot
API requires the use of a dataframe. Many different layers can be added to obtain the desired results. Different data can be used in the different layers.
Exercise: Your turn to try! Try to facet by flower_part
and use Species as an aesthetic variable. Try to use facet_wrap
instead of facet_grid
.
Excel theme
library(ggthemes)
my_plot + theme_excel(base_size = 24)
Wall Street Journal theme
my_plot + theme_wsj(base_size = 18)
With ggplot2 you can create your own theme, and save it if you want to reuse it later.
my_plot + # Polar coordinate!
coord_polar() + theme(legend.background = element_rect(fill = "pink"), text = element_text(size = 24))
Please look at the documentation for more details. You have no more excuses to create boring graphs!
Here are some good references for mastering ggplot2
-
R Graphics Cookbook by Winston Chang
-
ggplot2 recipes by Winston Chang (Free online)
-
ggplot2: Elegant Graphics for Data Analysis (Use R!) by Hadley Wickham
Please pay attention to the fonts, fontsize and colors you use. You may want to look at:
-
Colorbrewer, palettes available through the
RColorBrewer
package.ggplot2
provides options to use Colorbrewer palettes.