Skip to content

Polish chapter 6 #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 15 additions & 14 deletions args-hidden.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ source("common.R")

## What's the problem?

Functions are easier to understand if the results depend only on the values of the inputs. If a function returns surprisingly different results with the same inputs, then we say it has __hidden arguments__. Hidden arguments make code harder to reason about, because to correctly predict the output you also need to know some other state.
Functions are easier to understand if the results depend only on the values of the inputs. If a function returns surprisingly different results with the same inputs, then we say it has __hidden arguments__. Hidden arguments make code harder to reason about, because to correctly predict the output you also need to know some other state(s).

Related:

Expand All @@ -15,16 +15,17 @@ Related:

## What are some examples?

One common source of hidden arguments is the use of global options. These can be useful to control display but, as discussed in Chapter \@ref(def-user)), should not affect computation:
One common source of hidden arguments is the use of global options. These can be useful to control display but, as discussed in Chapter \@ref(def-user), should not affect computation:

* The result of `data.frame(x = "a")$x` depends on the value of the global
`stringsAsFactors` option: if it's `TRUE` (the default) you get a factor;
if it's false, you get a character vector.
`stringsAsFactors` option: if it's `TRUE` (the default), you get a factor;
if it's `FALSE`, you get a character vector.

* `lm()`'s handling of missing values depends on the global option of
`na.action`. The default is `na.omit` which drops the missing values
prior to fitting the model (which is inconvenient because then the results
of `predict()` don't line up with the input data. `modelr::na.warn()`
of `predict()` don't line up with the input data.
[`modelr::na.warn()`](https://modelr.tidyverse.org/reference/na.warn.html)
provides an approach more in line with other base behaviours: it drops
missing values with a warning.)

Expand All @@ -33,7 +34,7 @@ Another common source of hidden inputs is the system locale:
* `strptime()` relies on the names of weekdays and months in the current
locale. That means `strptime("1 Jan 2020", "%d %b %Y")` will work on
computers with an English locale, and fail elsewhere. This is particularly
troublesome for Europeans who frequently have colleagues who speak a
troublesome for Europeans who frequently have colleagues speaking a
different language.

* `as.POSIXct()` depends on the current timezone. The following code returns
Expand All @@ -43,7 +44,7 @@ Another common source of hidden inputs is the system locale:
as.POSIXct("2020-01-01 09:00")
```

* `toupper()` and `tolower()` depend on the current locale. It is faily
* `toupper()` and `tolower()` depend on the current locale. It is fairly
uncommon for this to cause problems because most languages either
use their own character set, or use the same rules for capitalisation as
English. However, this behaviour did cause a bug in ggplot2 because
Expand All @@ -63,7 +64,7 @@ Another common source of hidden inputs is the system locale:
order defined by the current locale. `factor()` uses `order()`, so the
results from factor depend implicitly on the current locale. (This is
not an imaginary problem as this
[SO question](https://stackoverflow.com/questions/39339489)) attests).
[SO question](https://stackoverflow.com/questions/39339489) attests).

Some functions depend on external settings, but not in a surprising way:

Expand All @@ -77,17 +78,17 @@ Some functions depend on external settings, but not in a surprising way:

* Random number generators like `runif()` peek at the value of the special
global variable `.Random.seed`. This is a little surprising, but if they
didn't have some global state every call to `runif()` would return the
didn't have some global state, every call to `runif()` would return the
same value.

## Why is it important?

Hidden arguments are bad because they make it much harder to predict the output of a fuction. The worst offender by far is the `stringsAsFactors` option which changes how a number of functions (including `data.frame()`, `as.data.frame()`, and `read.csv()`) treat character vectors. This exists mostly for historical reasons, as described in [*stringsAsFactors: An unauthorized biography*](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh)
by Thomas Lumley. )
Hidden arguments are bad because they make it much harder to predict the output of a function. The worst offender by far is the `stringsAsFactors` option which changes how a number of functions (including `data.frame()`, `as.data.frame()`, and `read.csv()`) treat character vectors. This exists mostly for historical reasons, as described in [*stringsAsFactors: An unauthorized biography*](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh)
by Thomas Lumley.

Allowing the system locale to affect the result of a function is a subtle source of bugs when sharing code between people who work in different countries. To be clear, these defaults on rarely cause problems because most languages that share the same writing system share (most of) the same collation rules. The main exceptions tend to be European languages which have varying rules for modified letters, e.g. in Norwegian, å comes at the end of the alphabet. However, when they do cause problems they will take a long time to track down: you're unlikely to expect that the coefficients of a linear model are different[^alpha-contrast] because your code is run in a different country!
Allowing the system locale to affect the result of a function is a subtle source of bugs when sharing code between people who work in different countries. To be clear, this rarely causes problems because most languages that share the same writing system also share (most of) the same collation rules. The main exceptions tend to be European languages which have varying rules for modified letters, e.g. in Norwegian, å comes at the end of the alphabet. However, when they do cause problems they will take a long time to track down: you're unlikely to expect that the coefficients of a linear model are different[^alpha-contrast] because your code is run in a different country!

[^alpha-contrast]: You'll get different coefficients for a categorical predictor if the ordering means that a different levels comes first in the alphabet. The predictions and other diagnostics won't be affected, but you're likely to be surprised that your coefficients are different.
[^alpha-contrast]: You'll get different coefficients for a categorical predictor if the ordering means that a different level comes first in the alphabet. The predictions and other diagnostics won't be affected, but you're likely to be surprised that your coefficients are different.

## How can I remediate the problem?

Expand All @@ -104,7 +105,7 @@ as.POSIXct <- function(x, tz = "") {
as.POSIXct("2020-01-01 09:00")
```

The `tz` argument is present, but it's not obvious that `""` means take from the system timezone. Let's first make that explicit:
The `tz` argument is present, but it's not obvious that `""` means to take the system timezone. Let's first make that explicit:

```{r}
as.POSIXct <- function(x, tz = Sys.timezone()) {
Expand Down