added plot for relationship bw features and labels

thanhquang1988 · Sep 16, 2023 · 8b7d50a · 8b7d50a
1 parent 435f1ed
commit 8b7d50a
Showing 1 changed file with 30 additions and 17 deletions.
diff --git a/2-Regression/4-Logistic/solution/R/lesson_4.Rmd b/2-Regression/4-Logistic/solution/R/lesson_4.Rmd
@@ -177,21 +177,6 @@ baked_pumpkins %>%
   slice_head(n = 5)
 ```
 
-Now let's compare the feature distributions for each label value using box plots. We'll begin by formatting the data to a *long* format to make it somewhat easier to make multiple `facets`.
-
-```{r pivot}
-# Pivot data to long format
-baked_pumpkins_long <- baked_pumpkins %>% 
-  pivot_longer(!color, names_to = "features", values_to = "values")
-
-
-# Print out restructured data
-baked_pumpkins_long %>% 
-  slice_head(n = 10)
-
-```
-
-
 Now, let's make a categorical plot showing the distribution of the predictors with respect to the outcome color!
 
 ```{r cat plot pumpkins-colors-variety}
@@ -208,6 +193,36 @@ ggplot(pumpkins, aes(y = Variety, fill = Color)) +
 
 Amazing🤩! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin.
 
+### **Analysing relationships between features and label**
+
+```{r}
+
+# Define the color palette
+palette <- c(ORANGE = "orange", WHITE = "wheat")
+
+# We need the encoded Item Size column to use it as the x-axis values in the plot
+pumpkins_select$item_size <- baked_pumpkins$item_size
+
+# Create the grouped box plot
+ggplot(pumpkins_select, aes(x = `item_size`, y = color, fill = color)) +
+  geom_boxplot() +
+  facet_grid(variety ~ ., scales = "free_x") +
+  scale_fill_manual(values = palette) +
+  labs(x = "Item Size", y = "") +
+  theme_minimal() +
+  theme(strip.text = element_text(size = 12)) +
+  theme(axis.text.x = element_text(size = 10)) +
+  theme(axis.title.x = element_text(size = 12)) +
+  theme(axis.title.y = element_blank()) +
+  theme(legend.position = "bottom") +
+  guides(fill = guide_legend(title = "Color")) +
+  theme(panel.spacing = unit(2.0, "lines"))+
+  theme(strip.text.y = element_text(size = 4, hjust = 0)) 
+
+```
+
+Let's now focus on a specific relationship: Item Size and Color!
+
 #### **Use a swarm plot**
 
 Color is a binary category (Orange or Not), it's called `categorical data`. There are other various ways of [visualizing categorical data](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar).
@@ -230,8 +245,6 @@ baked_pumpkins %>%
 Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.
 
 
-### **Analysing relationships between features and label**
-
 ## 3. Build your model
 
 Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.