Skip to content

Commit b12a49d

Browse files
author
Susan Vanderplas
committed
Update with additional try it out
1 parent f67c738 commit b12a49d

File tree

2 files changed

+80
-2
lines changed

2 files changed

+80
-2
lines changed

_freeze/part-wrangling/08-functional-prog/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

part-wrangling/08-functional-prog.qmd

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -791,5 +791,83 @@ chickens.head()
791791
:::
792792

793793

794+
::: callout-tip
795+
### Try It Out: Cleaning Chicken Data
796+
797+
::: panel-tabset
798+
799+
#### Problem
800+
801+
Unnest the chicken breed facts data, cleaning the responses.
802+
Which jobs are most suitable for a functional programming approach?
803+
804+
#### R solution
805+
806+
```{r}
807+
# Column names in breed_facts are too different
808+
# chickens_exp <- chickens |> unnest('breed_facts', names_sep='facts')
809+
810+
fix_names <- function(df) {
811+
if (!is.null(df)) {
812+
names(df) <- names(df) |>
813+
str_to_title() |>
814+
str_remove_all("[^A-z]") |> # Remove anything that isn't A-z, including spaces.
815+
str_replace_all(c("CountryOfOrigin?" = "Origin", "Weights" = "Weight", "Tlc" = "TLC", "Albc" = "ALBC", "Apa" = "APA", "BroodyS" = "Broody", "Temperment" = "Temperament", "Broody" = "Broody_facts", "Purpose" = "Purpose_facts")) |>
816+
str_remove_all("Shell|FarmSource|SourceFarm|Small|PoultryShow") |>
817+
str_replace_all("^$", "xxx") # replace blank names with xxx
818+
df
819+
} else {
820+
return(NULL)
821+
}
822+
}
823+
chickens_fix <- chickens |>
824+
mutate(breed_facts = map(breed_facts, fix_names))
825+
826+
# Test names
827+
chickens_fix$breed_facts %>% map(names) |> unlist() |> unique()
828+
```
829+
830+
We've fixed some of the misspellings and duplications. Rooster, Pullet, and Cockerel are all likely to be parsing issues stemming from Weight, but that's the reality of working with data that is gathered from the internet.
831+
832+
```{r}
833+
chickens_exp <- chickens_fix |> unnest("breed_facts")
834+
835+
head(chickens_exp[,c(1, 16:37)])
836+
```
837+
838+
There's still quite a bit of cleaning left to do to get this data to be "pretty".
839+
840+
```{r}
841+
tidy_col <- function(x, text = "(?:\\(estimates only, see FAQ\\))|(?:^APA)|(?:^TLC)|EggSize|(?:Fertility Percentage)|(?:Purpose and Type)") {
842+
str_remove_all(x, "[\u0600-\u06FF]") |> # Remove non-ascii characters
843+
str_remove_all("[®™Â–]") |>
844+
str_remove_all(text) |>
845+
str_remove_all("[:\\.\\?!\\*]") |>
846+
str_replace_all("\u0094", "-") |>
847+
str_replace_all("-{1,}", "-") |>
848+
str_squish()
849+
}
850+
851+
tmp <- mutate(chickens_exp, across(Class:Purpose_facts, tidy_col))
852+
853+
head(select(tmp, 1, Class:Purpose_facts))
854+
```
855+
856+
If we consider the use of `across()` as a functional programming technique (which it is), then it is much easier to create a generic `tidy_col` function than to tidy each column individually. There are probably a few things we've missed, but the data looks decent for the amount of time we put in.
857+
858+
#### Python
859+
860+
```{python}
861+
import pandas as pd
862+
863+
```
864+
865+
XXX TODO
866+
867+
:::
868+
869+
:::
870+
871+
794872

795873
## References

0 commit comments

Comments
 (0)