Skip to content

Commit

Permalink
update ps-key-12 thru 15 and ps-12 with charlie edits
Browse files Browse the repository at this point in the history
  • Loading branch information
taliaferrojm committed Oct 10, 2024
1 parent 9723695 commit 1aa2d37
Show file tree
Hide file tree
Showing 5 changed files with 74 additions and 146 deletions.
86 changes: 21 additions & 65 deletions problem-set-keys/ps-key-12.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,41 +14,16 @@ library(janitor)
library(here)
```

```{r}
biochem <- read_tsv("http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/Biochemistry.txt", show_col_types = FALSE) |>
janitor::clean_names()
# simplify names a bit more
colnames(biochem) <- gsub(pattern = "biochem_", replacement = "", colnames(biochem))
# we are going to simplify this a bit and only keep some columns
keep <- colnames(biochem)[c(1, 6, 9, 14, 15, 24:28)]
biochem <- biochem[, keep]
# get weights for each individual mouse
# careful: did not come with column names
weight <- read_tsv("http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/weight", col_names = F, show_col_types = FALSE)
# add column names
colnames(weight) <- c("subject_name", "weight")
# add weight to biochem table and get rid of NAs
# rename gender to sex
b <- inner_join(biochem, weight, by = "subject_name") |>
na.omit() |>
rename(sex = gender)
```

## Problem \# 1

Does mouse sex explain mouse total cholesterol levels? Make sure to run chunks above.

### 1. Examine and specify the variable(s)
### 1. Examine and specify the variable(s) (1 pt)

> The response variable y is $tot\_cholesterol$\
> The explantory variable x is $sex$
### Make a plot:
### Make a violin plot: (2 pt)

response variable on the y-axis

Expand All @@ -64,7 +39,7 @@ ggviolin(
)
```

### Get n, mean, median, sd
### Get n, mean, median, sd (1 pt)

```{r}
b |>
Expand All @@ -75,7 +50,7 @@ b |>
)
```

### Is it normally distribute?
### Is it normally distribute? (1 pt)

```{r}
ggqqplot(
Expand All @@ -93,7 +68,7 @@ b |>

> Yes, based on the qq-plot and the high $n$, but i do understand if you want to play it safe due to the shapiro_test p-value.
### Is it variance similar between groups?
### Is it variance similar between groups? (1 pt)

```{r}
b |>
Expand All @@ -103,15 +78,15 @@ b |>

> Yes
### What kind of test are you picking and why?
### What kind of test are you picking and why? (1 pt)

> t_test since i think it is normally distribute, with equal variance based on levene test
### 2. Declare null hypothesis $\mathcal{H}_0$
### 2. Declare null hypothesis $\mathcal{H}_0$ (1 pt)

$\mathcal{H}_0$ is that $sex$ does not explain $tot\_cholesterol$

### 3. Calculate test-statistic, exact p-value and plot
### 3. Calculate test-statistic, exact p-value and plot (2 pt)

```{r}
b |>
Expand Down Expand Up @@ -143,39 +118,18 @@ ggviolin(
ylim(1, 6)
```

```{r}
# i have pre-selected some families to compare
myfams <- c(
"B1.5:E1.4(4) B1.5:A1.4(5)",
"F1.3:A1.2(3) F1.3:E2.2(3)",
"A1.3:D1.2(3) A1.3:H1.2(3)",
"D5.4:G2.3(4) D5.4:C4.3(4)"
)
# only keep the familys in myfams
bfam <- b |>
filter(family %in% myfams) |>
droplevels()
# simplify family names and make factor
bfam$family <- gsub(pattern = "\\..*", replacement = "", x = bfam$family) |>
as.factor()
# make B1 the reference (most similar to overall mean)
bfam$family <- relevel(x = bfam$family, ref = "B1")
```
> can reject null hypothesis
## Problem \# 2

Does mouse family explain mouse total cholesterol levels? Make sure to run chunk above.

### 1. Examine and specify the variable(s)
### 1. Examine and specify the variable(s) (1 pt)

> The response variable y is $tot\_cholesterol$\
> The explantory variable x is $family$
### Make a plot:
### Make a plot: (2 pt)

response variable on the y-axis

Expand All @@ -191,7 +145,7 @@ ggviolin(
)
```

### Get n, mean, median, sd
### Get n, mean, median, sd (1 pt)

```{r}
bfam |>
Expand All @@ -202,7 +156,7 @@ bfam |>
)
```

### Is it normally distribute?
### Is it normally distribute? (1 pt)

```{r}
ggqqplot(
Expand All @@ -218,27 +172,27 @@ bfam |>
gt()
```

> Yes
> yes
### Is it variance similar between groups?
### Is it variance similar between groups? (1 pt)

```{r}
b |>
levene_test(tot_cholesterol ~ family) |>
gt()
```

> Yes
> yes
### What kind of test are you picking and why?
### What kind of test are you picking and why? (1 pt)

> anova_test since i think it is normally distribute and has equal variance
### 2. Declare null hypothesis $\mathcal{H}_0$

$\mathcal{H}_0$ is that $family$ does not explain $tot\_cholesterol$
\mathcal{H}\_0\$ is that $family$ does not explain $tot\_cholesterol$ (1 pt)

### 3. Calculate test-statistic, exact p-value and plot
### 3. Calculate test-statistic, exact p-value and plot (2 pt)

```{r}
bfam |>
Expand All @@ -255,3 +209,5 @@ ggviolin(
) +
stat_anova_test()
```

> My interpretation of the result
50 changes: 19 additions & 31 deletions problem-set-keys/ps-key-13.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
@@ -1,204 +0,0 @@
---
title: "Problem Set Stats Bootcamp - class 12"
title: "Problem Set Stats Bootcamp - class 13"
subtitle: "Hypothesis Testing"
author: "Neelanjan Mukherjee"
editor: visual
Expand Down Expand Up @@ -45,7 +44,7 @@ b <- inner_join(biochem, weight, by = "subject_name") |>

Is there an association between mouse calcium and sodium levels?

### 1. Make a scatterplot to inspect variable
### 1. Make a scatterplot to inspect variable --- (2 pts)

```{r}
ggscatter(
Expand All @@ -55,7 +54,7 @@ ggscatter(
)
```

### 2. Are they normal (enough)?
### 2. Are they normal (enough)? --- (1 pts)

```{r}
ggqqplot(data = b, x = "calcium")
Expand All @@ -71,11 +70,11 @@ Which test will you use and why?

> I will use the Pearson since the qqplots look ok and there are \~1700 observations. Spearman is fine too - especially since sodium looks a little like an integer and not continuous.
### 3. Declare null hypothesis $\mathcal{H}_0$
### 3. Declare null hypothesis $\mathcal{H}_0$ --- (1 pts)

$\mathcal{H}_0$ is that there is no dependency/association between $calcium$ and $sodium$
$\mathcal{H}_0$ is that there is no dependency/association between $calcium$ and $sodium$

### 4. Calculate and plot the correlation on a scatterplot
### 4. Calculate and plot the correlation on a scatterplot --- (2 pts)

```{r}
ggscatter(
Expand All @@ -90,27 +89,21 @@ ggscatter(
)
```




## Problem \# 2

## Do mouse calcium levels explain mouse sodium levels? If so, to what extent?
## Do mouse calcium levels explain mouse sodium levels? If so, to what extent?

Use a linear model to do the following:
Use a linear model to do the following:

### 1. Specify the Response and Explanatory variables --- (2 pts)

> The response variable y is sodium
> The explantory variable x is calcium
### 1. Specify the Response and Explanatory variables --- (2 pts)

> The response variable y is sodium The explantory variable x is calcium
### 2. Declare the null hypothesis --- (1 pts)

> The null hypothesis is calcium levels do not explain/predict sodium levels.
### 3. Use the `lm` function to create a fit (linear model)

### 3. Use the `lm` function to create a fit (linear model) --- (1 pts)

also save the slope and intercept for later

Expand All @@ -122,7 +115,7 @@ int <- tidy(fit)[1, 2]
slope <- tidy(fit)[2, 2]
```

### 4. Add residuals to the data and create a plot visualizing the residuals
### 4. Add residuals to the data and create a plot visualizing the residuals --- (1 pts)

```{r augment and viz}
b_fit <- augment(fit, data = b)
Expand Down Expand Up @@ -154,19 +147,16 @@ ggplot(
theme_linedraw()
```

### 5. Calculate the $R^2$ and compare to $R^2$ from fit
### 5. Calculate the $R^2$ and compare to $R^2$ from fit --- (2 pts)

$R^2 = 1 - \displaystyle \frac {SS_{fit}}{SS_{null}}$

$SS_{fit} = \sum_{i=1}^{n} (data - line)^2 = \sum_{i=1}^{n} (y_{i} - (\beta_0 \cdot 1+ \beta_1 \cdot x)^2$

$SS_{null}$ sum of squared errors around the mean of $y$
$SS_{null}$ --- sum of squared errors around the mean of $y$

$SS_{null} = \sum_{i=1}^{n} (data - mean)^2 = \sum_{i=1}^{n} (y_{i} - \overline{y})^2$




```{r rsq}
ss_fit <- sum(b_fit$.resid^2)
ss_null <- sum(
Expand All @@ -176,12 +166,11 @@ rsq <- 1 - ss_fit / ss_null
glance(fit) |> select(r.squared)
```

### 6. Using $R^2$, describe the extent to which calcium explains sodium levels --- (2 pts)

### 6. Using $R^2$, describe the extent to which calcium explains sodium levels
> $calcium$ explains \~68% of variation in $sodium$ levels
> $calcium$ explains ~68% of variation in $sodium$ levels
### 7. Report (do not calculate) the $p-value$ and your decision on the null
### 7. Report (do not calculate) the $p-value$ and your decision on the null --- (1 pts)

```{r }
fit |>
Expand All @@ -197,15 +186,14 @@ Calcium levels to predict sodium levels.

What is the association between mouse weight and age levels for different sexes?


### 1. Calculate the pearson correlation coefficient between weight and age for females and males
### 1. Calculate the pearson correlation coefficient between weight and age for females and males --- (2 pts)

```{r}
b |>
group_by(factor(sex)) |>
cor_test(weight, age)
```

### 2. Describe your observations
### 2. Describe your observations --- (2 pts)

> The relationship between weight and age is stronger for males than it is for females.
Loading

0 comments on commit 1aa2d37

Please sign in to comment.