Skip to content

Commit f67c738

Browse files
author
Susan Vanderplas
committed
Trying to reduce how much text the chunks put out
1 parent 2f527a2 commit f67c738

File tree

7 files changed

+42
-29
lines changed

7 files changed

+42
-29
lines changed

_freeze/part-wrangling/08-functional-prog/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.
122 KB
Loading
130 KB
Loading
162 KB
Loading
166 KB
Loading

part-wrangling/08-functional-prog.qmd

Lines changed: 35 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -417,7 +417,7 @@ res_tbl <- map_df(res_json, as_tibble) %>%
417417
writeLines(toJSON(res_tbl, pretty = TRUE), con = "../data/Star_Trek.json")
418418
```
419419

420-
![The Movie Database](../images/wrangling/tmdb.svg){fig-alt="The Movie Database logo"}
420+
![The Movie Database](../images/wrangling/tmdb.svg){fig-alt="The Movie Database logo" width="50%"}
421421
In this section we'll work with some data gathered from TMDB (the movie database).
422422
I submitted a query for all movies that Patrick Stewart was involved with, and you can find the resulting JSON file [here](https://raw.githubusercontent.com/srvanderplas/stat-computing-r-python/main/data/Patrick_Stewart.json).
423423

@@ -436,9 +436,16 @@ library(jsonlite)
436436
data_url <- "https://raw.githubusercontent.com/srvanderplas/stat-computing-r-python/main/data/Patrick_Stewart.json"
437437
438438
ps_json <- fromJSON(data_url)
439-
head(ps_json)
440439
```
441440

441+
442+
<details><summary>Exploring the output structure</summary>
443+
```{r}
444+
# head(ps_json) # This output is too long
445+
map(ps_json, head) # show the first 6 rows of each element in the list
446+
```
447+
</details>
448+
442449
By default, fromJSON does a LOT of heavy lifting for us:
443450

444451
1. Identifying the structure of the top-level data: cast, crew, and id information
@@ -451,28 +458,42 @@ It's hard to explain how *nice* this is to someone who hasn't had to parse this
451458
library(jsonlite)
452459
453460
ps_messy <- fromJSON(data_url, simplifyVector = T, simplifyDataFrame = F)
461+
```
462+
454463

464+
<details><summary>Exploring the output structure (long version)</summary>
465+
```{r}
455466
# Top-level objects (show the first object in the list)
456467
ps_messy$cast[[1]]
457468
ps_messy$crew[[1]]
458469
ps_messy$id
459470
```
471+
</details>
460472

461473
Let's start with the cast list. Most objects seem to be single entries; the only thing that isn't is the `genre_ids` field. So let's see whether we can just convert each list entry to a data frame, and then deal with the `genre_ids` column afterwards.
462474

463475
```{r, error = T}
464476
cast_list <- ps_messy$cast
477+
```
478+
465479

480+
<details><summary>Data frame conversion</summary>
481+
```{r}
466482
as.data.frame(cast_list[[1]])
483+
```
484+
</details>
467485

486+
```{r}
468487
map(cast_list, as.data.frame)
469488
```
470489

471490
Well, that didn't work, but the error message at least tells us what index is causing the problem: 6. Let's look at that data:
472491

492+
<details><summary>Data frame conversion errors</summary>
473493
```{r}
474-
cast_list[[6]]
494+
cast_list[[6]][1:5]
475495
```
496+
</details>
476497

477498
Ok, so `backdrop_path` is `NULL`, and `as.data.frame` can't handle the fact that some fields are defined (length 1) and others are NULL (length 0). We could possibly replace the NULL with NA first?
478499

@@ -482,11 +503,12 @@ fix_nulls <- function(x) {
482503
}
483504
484505
cast_list_fix <- map(cast_list, fix_nulls)
485-
cast_list_fix[[6]]
506+
507+
cast_list_fix[[6]][1:5]
486508
487509
map(cast_list_fix, as.data.frame)
488510
489-
cast_list_fix[[8]]
511+
cast_list_fix[[8]][1:5]
490512
```
491513

492514
Ok, well, this time, we have an issue with position 8, and we have an empty list of genre_ids.
@@ -500,18 +522,18 @@ fix_nulls <- function(x) {
500522
}
501523
502524
cast_list_fix <- map(cast_list, fix_nulls)
503-
cast_list_fix[[8]]
525+
cast_list_fix[[8]][1:5]
504526
505527
cast_list_df <- map_df(cast_list_fix, as.data.frame)
506-
cast_list_df
528+
cast_list_df[1:10, 1:5]
507529
```
508530

509531
We still have too many rows for each entry because of the multiple `genre_ids`.
510532
But we can fix that with the `nest` command.
511533

512534
```{r}
513535
cast_list <- nest(cast_list_df, genre_ids = genre_ids )
514-
cast_list
536+
cast_list[1:10,c(1:4, 17)]
515537
```
516538

517539
Then, we'd have to apply this whole process to the crew list as well.
@@ -522,7 +544,7 @@ crew_list <- ps_messy$crew
522544
crew_list_fix <- map(crew_list, fix_nulls)
523545
crew_list_df <- map_df(crew_list_fix, as.data.frame)
524546
crew_list <- nest(crew_list_df, genre_ids = genre_ids )
525-
crew_list
547+
crew_list[1:5,c(1:4, 17)]
526548
```
527549

528550
Ok, so that actually worked, but only because the structure of the crew data is the same as the structure of the cast data.
@@ -560,6 +582,9 @@ If we read the [documentation for read_json](https://pandas.pydata.org/docs/refe
560582
```{python}
561583
patrick_stewart = pd.read_json(data_url, typ='series', orient = 'records')
562584
585+
# List the objects
586+
patrick_stewart.index
587+
563588
# First item in the cast list
564589
patrick_stewart.cast[0]
565590
```
@@ -595,7 +620,7 @@ ps_movies[['id', 'original_title', 'character', 'job']].sort_values(['id'])
595620
::: callout-tip
596621

597622
### Try It Out: JSON File Parsing
598-
![The Movie Database API](../images/wrangling/tmdb.svg){fig-alt="The Movie Database logo"}
623+
![The Movie Database](../images/wrangling/tmdb.svg){fig-alt="The Movie Database logo" width="50%"}
599624

600625
I used TMDB to find all movies resulting from the query "Star Trek" and stored the resulting JSON file [here](https://raw.githubusercontent.com/srvanderplas/stat-computing-r-python/main/data/Star_Trek.json).
601626

renv/activate.R

Lines changed: 5 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -63,10 +63,6 @@ local({
6363
if (is.environment(x) || length(x)) x else y
6464
}
6565

66-
`%??%` <- function(x, y) {
67-
if (is.null(x)) y else x
68-
}
69-
7066
bootstrap <- function(version, library) {
7167

7268
# attempt to download renv
@@ -87,22 +83,11 @@ local({
8783

8884
renv_bootstrap_repos <- function() {
8985

90-
# get CRAN repository
91-
cran <- getOption("renv.repos.cran", "https://cloud.r-project.org")
92-
9386
# check for repos override
9487
repos <- Sys.getenv("RENV_CONFIG_REPOS_OVERRIDE", unset = NA)
95-
if (!is.na(repos)) {
96-
97-
# check for RSPM; if set, use a fallback repository for renv
98-
rspm <- Sys.getenv("RSPM", unset = NA)
99-
if (identical(rspm, repos))
100-
repos <- c(RSPM = rspm, CRAN = cran)
101-
88+
if (!is.na(repos))
10289
return(repos)
10390

104-
}
105-
10691
# check for lockfile repositories
10792
repos <- tryCatch(renv_bootstrap_repos_lockfile(), error = identity)
10893
if (!inherits(repos, "error") && length(repos))
@@ -119,7 +104,10 @@ local({
119104
repos <- getOption("repos")
120105

121106
# ensure @CRAN@ entries are resolved
122-
repos[repos == "@CRAN@"] <- cran
107+
repos[repos == "@CRAN@"] <- getOption(
108+
"renv.repos.cran",
109+
"https://cloud.r-project.org"
110+
)
123111

124112
# add in renv.bootstrap.repos if set
125113
default <- c(FALLBACK = "https://cloud.r-project.org")

0 commit comments

Comments
 (0)