Skip to content

Commit

Permalink
Update plots and question phrasing, closes OpenIntroStat#124
Browse files Browse the repository at this point in the history
  • Loading branch information
mine-cetinkaya-rundel committed Sep 13, 2021
1 parent c3bb328 commit 92b65e7
Showing 1 changed file with 44 additions and 30 deletions.
74 changes: 44 additions & 30 deletions 02-explore/04-lesson/02-04-lesson.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,17 @@ library(gapminder)
library(emo)
library(patchwork)
knitr::opts_chunk$set(warning = FALSE,
message = FALSE,
echo = FALSE,
fig.height = 3,
fig.width = 5,
fig.align = "center")
knitr::opts_chunk$set(
echo = FALSE,
fig.width = 6,
fig.asp = 0.618,
fig.retina = 3,
dpi = 300,
out.width = "80%",
fig.align = "center"
)
options(max.print=4000)
options(max.print = 4000)
email <- email %>%
mutate(spam = if_else(spam == 0, "not spam", "spam"))
Expand All @@ -34,6 +37,8 @@ comics <- read_csv("data/comics.csv") %>%

## Emails

###

In this lesson, you'll get a chance to put to use what you know about exploratory data analysis in exploring a new dataset.

The `email` dataset contains information on all of the emails received by a single email account in a single month. Each email is a case, so we see that this email account received 3,921 emails. Twenty-one different variables were recorded on each email, some of which are numerical, others are categorical. Of primary interest is the first column, a categorical variable indicating whether or not the email is `spam`. This was created by actually reading through each email individually and deciding if it looked like spam or not. The subsequent columns were easier to create. `to_multiple` is TRUE if the email was addressed to more than one recipient and FALSE otherwise. `image` is the number of images attached in the email. It's important that you have a full sense of what each of the variables mean, so you'll want to start your exercises by reading about them in the [data set description](https://www.openintro.org/data/index.php?data=email).
Expand Down Expand Up @@ -219,7 +224,7 @@ email %>%
email %>%
mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
ggplot(aes(x = log_exclaim_mess)) +
geom_histogram() +
geom_histogram(binwidth = 0.5) +
facet_wrap(~spam)
```

Expand Down Expand Up @@ -255,7 +260,7 @@ p1 <- email %>%
p2 <- email %>%
mutate(log_exclaim_mess = log(exclaim_mess+0.01)) %>%
ggplot(aes(x = log_exclaim_mess)) +
geom_histogram()+
geom_histogram(binwidth = 0.5)+
facet_wrap(~spam)
p1 / p2
Expand All @@ -265,13 +270,13 @@ p1 / p2

One approach says that there are two mechanisms going on: one generating the zeros and the other generating the non-zeros, so we will analyze these two groups separately. A simpler approach is to think of the variable as two level categorical variable (zero or not zero) and to treat it like a categorical variable.

```{r 9}
```{r 9, fig.asp = 0.5}
email2 <- email %>%
mutate(zero = exclaim_mess == 1, log_exclaim_mess = log(exclaim_mess))
email2 %>%
ggplot(aes(x = log_exclaim_mess, fill = zero)) +
geom_histogram()+
geom_histogram(binwidth = 0.5)+
facet_wrap(~spam)+
theme(legend.position = "none")
```
Expand All @@ -280,7 +285,7 @@ email2 %>%

To do so we create a new variable, we'll call it `zero` and use it to color in our bar chart.

```{r facet}
```{r facet, fig.asp = 0.5}
email %>%
mutate(zero = exclaim_mess == 0) %>%
ggplot(aes(x = zero, fill = zero)) +
Expand Down Expand Up @@ -576,7 +581,7 @@ That led to this bar chart, which was informative, but you might caused you to d

Let's go through how we would flip the ordering of those bars so that they agree with the plot on the left.

```{r switch-bars}
```{r switch-bars, out.width = "100%"}
knitr::include_graphics("images/switch-bars.png")
```

Expand All @@ -601,7 +606,7 @@ email <- email %>%

Now, when we go to make the plot with the same code, the order of the bars are as we wan them.

```{r echo=TRUE}
```{r echo=TRUE, fig.asp = 0.5}
email %>%
ggplot(aes(x = zero)) +
geom_bar() +
Expand All @@ -627,6 +632,7 @@ To explore the association between this variable and `spam`, select and construc
Let's practice constructing a faceted bar plot.

- Construct a faceted bar plot of the association between `number` and `spam`.
- You should facet by `number`

```{r ex6, exercise = TRUE}
# Construct plot of
Expand All @@ -637,25 +643,32 @@ ___(email, ___

```{r ex6-solution}
# Construct plot of number_reordered
ggplot(email, aes(x = number)) +
ggplot(email, aes(x = spam)) +
geom_bar() +
facet_wrap(~ spam)
facet_wrap(~ number)
```

### What's in a number interpretation

```{r mc4-pre}
Below are two plots showing the relationship between `number` and `spam`.

```{r mc4-pre1, fig.asp = 0.5}
ggplot(email, aes(x = spam)) +
geom_bar() +
facet_wrap(~number)
```

```{r mc4-pre2, fig.asp = 0.5}
ggplot(email, aes(x = number)) +
geom_bar() +
facet_wrap(~spam)
```

```{r mc4}
question("Which of the following interpretations of the plot **is not** valid?",
answer("Given that an email contains a small number, it is more likely to be
not spam"),
question("Which of the following interpretations of the above plots is *not* valid?",
answer("Given that an email contains a small number, it is more likely to be not spam"),
answer("Given that an email contains no number, it is more likely to be spam.", correct = TRUE),
answer("Given that an email contains a big number, it is more likely to be not spam.", correct = TRUE),
answer("Given that an email contains a big number, it is more likely to be not spam."),
answer("Within both spam and not spam, the most common number is a small one.", message = "Wrong!"),
incorrect = "Given the kind of number an email contains, we can't say whether it'll be more or less likely to be spam.",
allow_retry = TRUE
Expand All @@ -679,32 +692,33 @@ comics %>%

On the bar plot, it's quite difficult to tell how the `NA` group compares to the `Public` group, but this comparison is much easier in the bar plot. Also `Unknown` is barely visible in the bar plot and almost not there in the pie chart.

```{r}
```{r out.width = "100%"}
p1 <- comics %>%
count(id) %>%
ggplot(aes(x = "", y = n, fill = id))+
geom_bar(stat="identity", width=1)+
coord_polar("y", start=0) +
geom_bar(stat="identity", width = 1)+
coord_polar("y", start = 0) +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())
p2 <- ggplot(comics, aes(x = id, fill = id)) +
geom_bar() +
guides(fill = FALSE)
p2 <- ggplot(comics, aes(y = id, fill = id)) +
geom_bar(show.legend = FALSE)
p1 / p2
p1 + p2
```


### Faceting vs. stacking

We saw how the story can change, depending on if we're visualizing counts or proportions.

```{r fig.width=10}
```{r out.width = "100%"}
p1 <- ggplot(comics, aes(y = id))+
geom_bar() +
facet_wrap(~align, nrow = 1)
facet_wrap(~align, nrow = 1,
labeller = labeller(align = label_wrap_gen(10))) +
scale_x_continuous(guide = guide_axis(check.overlap = TRUE))
p2 <- ggplot(comics, aes(y = align, fill = id)) +
geom_bar()
Expand Down

0 comments on commit 92b65e7

Please sign in to comment.