-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework problems
example in "Getting Started"
#1431
base: main
Are you sure you want to change the base?
Conversation
@@ -142,15 +142,6 @@ mtcars_spec | |||
cols_condense(mtcars_spec) | |||
``` | |||
|
|||
|
|||
By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer true so it's been removed. Now, readr (via vroom) selects semi-randomized rows to select for guess, and always includes the last row for guessing. So increasing guess_max
won't necessarily resolve problems in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not reading the whole vignette at once, just these changed snippets. So you have a better sense of flow and what's currently included / dropped / moved.
But I wonder if, instead of deleting this text entirely, we should update it to make it correct. I.e. still make sure the user knows we default to using 1000 rows, which helps with speed but can lead to incorrect guesses. And that increasing that number might help, but the best solution is to specify the column types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rows are not selected in a random fashion. They're basically evenly spaced within the file and the last row is always included for guessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but I felt like being overly specific would be more confusing. I can modify to say "interspersed through-out the file and always includes the first and last row" instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this isn't technically true since guess_max
doesn't guess column types using rows it will never read and once we complete #1436 then n_max = guess_max
. So if guess_max = 1500
then n_max = 1500
and it will guess using the first 1500 rows in the data. So spec_*()
guessing will never be interspersed throughout the file, it'll always use all of the data it is given to guess. Here is an example using the original challenge.csv
where the column y
is filled with NA
until row 1001.
# after we complete #1436
spec_csv(readr_example("challenge.csv"), guess_max = 1000) # n_max = 1000
#> cols(
#> x = col_double(),
#> y = col_logical()
#> )
# increasing guess_max does change the spec
spec_csv(readr_example("challenge.csv"), guess_max = 1001) # n_max = 1001
#> cols(
#> x = col_double(),
#> y = col_date(format = "")
#> )
# compared to how guessing works in read_csv
# since it's dispersed throughout the file, it guesses correctly
spec(read_csv(readr_example("challenge.csv"), guess_max = 1000, show_col_types = FALSE)) # n_max = all the data
#> cols(
#> x = col_double(),
#> y = col_date(format = "")
#> )
So, how specific should the description for guessing be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The statement should not be technically misleading (e.g. suggesting that it's random and might select different rows each time you call it), but otherwise it's fine to stay high-level. So, yes, something along these lines: "interspersed through-out the file and always includes the first and last row".
Recall that we did create an entire new vignette on column type guessing, in case it makes sense to link to that (or borrow test from that), for more details: https://readr.tidyverse.org/articles/column-types.html.
To zoom out, I think a guiding principle is to avoid making this text so specific that it's tightly coupled with detail work like #1436. What level of understanding and detail is useful for the typical user?
The observation above about n_max
and guess_max
in spec()
brings up an important question to think about over in #1436. Should spec_*()
produce the same column type guesses as read_*()
at default settings?
```{r} | ||
spec(df1) | ||
spec(df2) | ||
In this example, one of the dates is in an incorrect format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use some garnishing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you have in mind?
vignettes/readr.Rmd
Outdated
df1 <- read_csv(readr_example("challenge.csv"), col_types = list( | ||
date = col_date(), | ||
value = col_double() | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this would still generate parsing errors even if we don't include/force col_types
so we could just have the following if we think it makes for a better narrative:
df1 <- read_csv(readr_example("challenge.csv"))
But I think it does drive the point across that supplying a col_spec
is safer.
Notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some comments inline. Overall, it's looking good: removing lots of incorrect content and updating it.
Zooming out:
Can you describe what changes you made to challenge.csv? And how does that change what we see in the rendered vignette? I can't tell that from reading this PR, i.e. I would have to fetch the PR and render the vignette.
One other concern: changing a dataset is potentially a breaking change. I.e. if someone, somewhere, uses the existing challenge.csv dataset, this will break their code. Did you do any research into this? For example, searching CRAN packages for usage of this dataset and just googling for "challenge.csv".
I could imagine this is basically an internal matter and it's OK to change, but we should do some due diligence. It could also show up when we run revdeps, but it's nicer to learn this sooner rather than later.
@@ -142,15 +142,6 @@ mtcars_spec | |||
cols_condense(mtcars_spec) | |||
``` | |||
|
|||
|
|||
By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not reading the whole vignette at once, just these changed snippets. So you have a better sense of flow and what's currently included / dropped / moved.
But I wonder if, instead of deleting this text entirely, we should update it to make it correct. I.e. still make sure the user knows we default to using 1000 rows, which helps with speed but can lead to incorrect guesses. And that increasing that number might help, but the best solution is to specify the column types.
```{r} | ||
spec(df1) | ||
spec(df2) | ||
In this example, one of the dates is in an incorrect format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you have in mind?
Co-authored-by: Jennifer (Jenny) Bryan <jenny.f.bryan@gmail.com>
@jennybc The file has a column of dates and a column of doubles. One of the dates is formatted as I've made changes to address most of your comments but I still need to look into whether other packages are using |
@jennybc I couldn't find any other packages that use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some inline comments.
High level comment about the new challenge.csv.
Does challenge.csv need to be 2000 lines long anymore?
Originally, I guess it needed to be over 1000 lines long to make a point about column type guessing for the y
column, which was missing for the first 1000 lines and then started to have dates in it. (That still doesn't fully explain why the file was 2000 lines long, but I guess that will remain a mystery.)
What is the purpose of the value
column in the new challenge.csv?
I notice that the original x
column was integers only for the first 1000 lines and then started to have doubles in it. So I think it must be or have been used to demonstrate something about integers vs doubles.
But I can't figure out the purpose of value
in the proposed challenge.csv, which doesn't seem to have this mix of integers and doubles.
Our modern approach to creating datasets in packages is to save the code that generates the dataset below data-raw/
. It helps with exercises like this one, i.e. it shows how to regenerate a dataset and (hopefully) leave some breadcrumbs re: why the data is the way it is.
This is explained in the Data chapter of R Packages and, as also described there, supported by functions like use_data_raw()
.
vignettes/readr.Rmd
Outdated
@@ -127,7 +127,7 @@ parse_number("$1,234") | |||
|
|||
There are two parsers that will never be guessed: `col_skip()` and `col_factor()`. You will always need to supply these explicitly. | |||
|
|||
You can see the specification that readr would generate for a column file by using `spec_csv()`, `spec_tsv()` and so on: | |||
You can see the specification that readr would generate for a file by using `spec_csv()`, `spec_tsv()`, and so on. | |||
|
|||
```{r} | |||
x <- spec_csv(readr_example("challenge.csv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make sense to reveal the spec? I.e. why are we assigning the result to x
?
(I have fetched the PR and rendered the vignette now. I assume you are also looking at the rendered result as you go.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the example was when I started editing the file and I didn't see a reason to modify it since we aren't doing a total overhaul right now and the goal is to ensure that the problems()
example has actual problems. Also, the behavior of spec_*()
is currently in-flight #1436 so some of the content here might change.
@@ -127,7 +127,7 @@ parse_number("$1,234") | |||
|
|||
There are two parsers that will never be guessed: `col_skip()` and `col_factor()`. You will always need to supply these explicitly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it also true the col_integer()
will not be guessed? I spotted this since we are working here in the immediate neighborhood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think so. Here is where the guessing happens https://github.com/tidyverse/vroom/blob/0c2423eb3bba77b05ef9fda1830f1f798cc2fedd/src/guess_type.cc#L128-L146
@@ -142,15 +142,6 @@ mtcars_spec | |||
cols_condense(mtcars_spec) | |||
``` | |||
|
|||
|
|||
By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rows are not selected in a random fashion. They're basically evenly spaced within the file and the last row is always included for guessing.
vignettes/readr.Rmd
Outdated
@@ -142,14 +142,13 @@ mtcars_spec | |||
cols_condense(mtcars_spec) | |||
``` | |||
|
|||
|
|||
By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows: | |||
When guessing column types, readr selects rows to use for guessing semi-randomly. The default number of rows it selects is 1000. If the column type guess is incorrect, increasing `guess_max` may or may not improve the results. | |||
|
|||
```{r} | |||
x <- spec_csv(readr_example("challenge.csv"), guess_max = 1001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, if we're trying to make a point about the spec, we should reveal it.
vignettes/readr.Rmd
Outdated
#> x = col_integer(), | ||
#> y = col_character() | ||
#> ) | ||
writeLines(read_lines(readr_example("challenge.csv"), skip = 1325, n_max = 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1325 feels a bit like a Magic Number here. It feels like we should probably get it programmatically, since we do have access to it and explain that we're inspecting the neighborhood right around the problem line.
vignettes/readr.Rmd
Outdated
``` | ||
|
||
You can also access it after the fact using `spec()`: | ||
We can work around this parsing error by importing the date column as character and converting it to a column of dates, after the fact. The lubridate package is a great option that we show here. But since readr doesn't depend on lubridate, we won't execute this code chunk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually could make this a static code chunk. That is, set it to eval = FALSE
, but inline the current result. It's not a perfect solution, but the pros would seem to outweigh the cons.
I'm talking about showing folks this:
> df1[1325:1327, ]
# A tibble: 3 × 2
date value
<date> <dbl>
1 2008-03-16 762
2 2008-02-17 206
3 2019-01-28 1333
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reiterate this:
I think you should add some brief comment ... that it's indicative of a general type of workaround, where you import as character and use a more specialized package (or custom function) to parse a tricky column.
vignettes/readr.Rmd
Outdated
```r | ||
read_csv("iris.csv", col_types = list( | ||
Sepal.Length = col_double(), | ||
Sepal.Width = col_double(), | ||
Petal.Length = col_double(), | ||
Petal.Width = col_double(), | ||
Species = col_factor(c("setosa", "versicolor", "virginica")) | ||
)) | ||
``` | ||
```{r, eval = FALSE} | ||
read_csv("iris.csv", col_types = list( | ||
Sepal.Length = col_double(), | ||
Sepal.Width = col_double(), | ||
Petal.Length = col_double(), | ||
Petal.Width = col_double(), | ||
Species = col_factor(c("setosa", "versicolor", "virginica")) | ||
)) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has messed up the intended indenting. Here and below.
Co-authored-by: Jennifer (Jenny) Bryan <jenny.f.bryan@gmail.com>
I created the new challenge.csv by modify pieces of the old challenge.csv so that's one reason why it's also 2000 lines long. It also made it easier to insert a parsing error into the file and makes the data feel more realistic.
The value column isn't being used to teach anything, its purpose is to make the data feel real since data with only one column would be odd. We could make this column also teach some other lesson about
I'll make a note to complete this TODO once we've landed on a dataset |
Given that the only column we're currently using for a concrete demo is a date column, I think we should just leave the legacy challenge.csv alone and create a new example file with exactly what you need. As I understand it, you want a single date column with at least one strategically positioned entry in a nonstandard format. This tricky entry should not cause the column to be guessed as character (so that dictates its position), but it should create a parsing problem. Make it minimal and sufficient to support what you want to show. Store the |
Fixes #1398
Modifies the
challenge.csv
file so that it actually reports some parsing problems. Also, updates some of the wording aroundspec_csv()
andguess_max
This PR doesn't aim to fix all of the issues with the "Getting Started" page and in fact we already have a PR open for that #1364. It only deals with making it better than before and fixing things that directly relate to
challenge.csv
.