Skip to content

Commit

Permalink
doc: misc updates for missing.md (#42500)
Browse files Browse the repository at this point in the history
  • Loading branch information
mishmanners authored Oct 20, 2021
1 parent 6d0b192 commit 05515f4
Showing 1 changed file with 75 additions and 48 deletions.
123 changes: 75 additions & 48 deletions doc/src/manual/missing.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# [Missing Values](@id missing)

Julia provides support for representing missing values in the statistical sense,
that is for situations where no value is available for a variable in an observation,
Julia provides support for representing missing values in the statistical sense.
This is for situations where no value is available for a variable in an observation,
but a valid value theoretically exists.
Missing values are represented via the [`missing`](@ref) object, which is the
singleton instance of the type [`Missing`](@ref). `missing` is equivalent to
Expand All @@ -15,7 +15,7 @@ and behaves like them in most situations.
operators and functions.
For these functions, uncertainty about the value of one of the operands
induces uncertainty about the result. In practice, this means a math operation
involving a `missing` value generally returns `missing`
involving a `missing` value generally returns `missing`:
```jldoctest
julia> missing + 1
missing
Expand All @@ -27,14 +27,16 @@ julia> abs(missing)
missing
```

As `missing` is a normal Julia object, this propagation rule only works
Since `missing` is a normal Julia object, this propagation rule only works
for functions which have opted in to implement this behavior. This can be
achieved either via a specific method defined for arguments of type `Missing`,
or simply by accepting arguments of this type, and passing them to functions
which propagate them (like standard math operators). Packages should consider
achieved by:
- adding a specific method defined for arguments of type `Missing`,
- accepting arguments of this type, and passing them to functions
which propagate them (like standard math operators).
Packages should consider
whether it makes sense to propagate missing values when defining new functions,
and define methods appropriately if that is the case. Passing a `missing` value
to a function for which no method accepting arguments of type `Missing` is defined
and define methods appropriately if this is the case. Passing a `missing` value
to a function which does not have a method accepting arguments of type `Missing`
throws a [`MethodError`](@ref), just like for any other type.

Functions that do not propagate `missing` values can be made to do so by wrapping
Expand All @@ -46,7 +48,8 @@ For example, `f(x)` becomes `passmissing(f)(x)`.

Standard equality and comparison operators follow the propagation rule presented
above: if any of the operands is `missing`, the result is `missing`.
Here are a few examples
Here are a few examples:

```jldoctest
julia> missing == 1
missing
Expand All @@ -66,9 +69,10 @@ be used to test whether a value is missing. To test whether `x` is `missing`,
use [`ismissing(x)`](@ref).

Special comparison operators [`isequal`](@ref) and [`===`](@ref) are exceptions
to the propagation rule: they always return a `Bool` value, even in the presence
to the propagation rule. They will always return a `Bool` value, even in the presence
of `missing` values, considering `missing` as equal to `missing` and as different
from any other value. They can therefore be used to test whether a value is `missing`
from any other value. They can therefore be used to test whether a value is `missing`:

```jldoctest
julia> missing === 1
false
Expand All @@ -85,7 +89,8 @@ true

The [`isless`](@ref) operator is another exception: `missing` is considered
as greater than any other value. This operator is used by [`sort`](@ref),
which therefore places `missing` values after all other values.
which therefore places `missing` values after all other values:

```jldoctest
julia> isless(1, missing)
true
Expand All @@ -100,18 +105,19 @@ false
## Logical operators

Logical (or boolean) operators [`|`](@ref), [`&`](@ref) and [`xor`](@ref) are
another special case, as they only propagate `missing` values when it is logically
required. For these operators, whether or not the result is uncertain depends
on the particular operation, following the well-established rules of
another special case since they only propagate `missing` values when it is logically
required. For these operators, whether or not the result is uncertain, depends
on the particular operation. This follows the well-established rules of
[*three-valued logic*](https://en.wikipedia.org/wiki/Three-valued_logic) which are
also implemented by `NULL` in SQL and `NA` in R. This abstract definition actually
implemented by e.g. `NULL` in SQL and `NA` in R. This abstract definition
corresponds to a relatively natural behavior which is best explained
via concrete examples.

Let us illustrate this principle with the logical "or" operator [`|`](@ref).
Following the rules of boolean logic, if one of the operands is `true`,
the value of the other operand does not have an influence on the result,
which will always be `true`
which will always be `true`:

```jldoctest
julia> true | true
true
Expand All @@ -123,12 +129,13 @@ julia> false | true
true
```

Based on this observation, we can conclude that if one of the operands is `true`
Based on this observation, we can conclude if one of the operands is `true`
and the other `missing`, we know that the result is `true` in spite of the
uncertainty about the actual value of one of the operands. If we had
been able to observe the actual value of the second operand, it could only be
`true` or `false`, and in both cases the result would be `true`. Therefore,
in this particular case, missingness does *not* propagate
in this particular case, missingness does *not* propagate:

```jldoctest
julia> true | missing
true
Expand All @@ -139,7 +146,8 @@ true

On the contrary, if one of the operands is `false`, the result could be either
`true` or `false` depending on the value of the other operand. Therefore,
if that operand is `missing`, the result has to be `missing` too
if that operand is `missing`, the result has to be `missing` too:

```jldoctest
julia> false | true
true
Expand All @@ -160,7 +168,8 @@ missing
The behavior of the logical "and" operator [`&`](@ref) is similar to that of the
`|` operator, with the difference that missingness does not propagate when
one of the operands is `false`. For example, when that is the case of the first
operand
operand:

```jldoctest
julia> false & false
false
Expand All @@ -173,7 +182,8 @@ false
```

On the other hand, missingness propagates when one of the operands is `true`,
for example the first one
for example the first one:

```jldoctest
julia> true & true
true
Expand All @@ -188,16 +198,17 @@ missing
Finally, the "exclusive or" logical operator [`xor`](@ref) always propagates
`missing` values, since both operands always have an effect on the result.
Also note that the negation operator [`!`](@ref) returns `missing` when the
operand is `missing` just like other unary operators.
operand is `missing`, just like other unary operators.

## Control Flow and Short-Circuiting Operators

Control flow operators including [`if`](@ref), [`while`](@ref) and the
[ternary operator](@ref man-conditional-evaluation) `x ? y : z`
do not allow for missing values. This is because of the uncertainty about whether
the actual value would be `true` or `false` if we could observe it,
which implies that we do not know how the program should behave. A [`TypeError`](@ref)
is thrown as soon as a `missing` value is encountered in this context
the actual value would be `true` or `false` if we could observe it.
This implies we do not know how the program should behave. In this case, a
[`TypeError`](@ref) is thrown as soon as a `missing` value is encountered in this context:

```jldoctest
julia> if missing
println("here")
Expand All @@ -208,7 +219,8 @@ ERROR: TypeError: non-boolean (Missing) used in boolean context
For the same reason, contrary to logical operators presented above,
the short-circuiting boolean operators [`&&`](@ref) and [`||`](@ref) do not
allow for `missing` values in situations where the value of the operand
determines whether the next operand is evaluated or not. For example
determines whether the next operand is evaluated or not. For example:

```jldoctest
julia> missing || false
ERROR: TypeError: non-boolean (Missing) used in boolean context
Expand All @@ -220,10 +232,11 @@ julia> true && missing && false
ERROR: TypeError: non-boolean (Missing) used in boolean context
```

On the other hand, no error is thrown when the result can be determined without
In contrast, there is no error thrown when the result can be determined without
the `missing` values. This is the case when the code short-circuits
before evaluating the `missing` operand, and when the `missing` operand is the
last one
last one:

```jldoctest
julia> true && missing
missing
Expand All @@ -234,7 +247,8 @@ false

## Arrays With Missing Values

Arrays containing missing values can be created like other arrays
Arrays containing missing values can be created like other arrays:

```jldoctest
julia> [1, missing]
2-element Vector{Union{Missing, Int64}}:
Expand All @@ -243,7 +257,7 @@ julia> [1, missing]
```

As this example shows, the element type of such arrays is `Union{Missing, T}`,
with `T` the type of the non-missing values. This simply reflects the fact that
with `T` the type of the non-missing values. This reflects the fact that
array entries can be either of type `T` (here, `Int64`) or of type `Missing`.
This kind of array uses an efficient memory storage equivalent to an `Array{T}`
holding the actual values combined with an `Array{UInt8}` indicating the type
Expand All @@ -252,6 +266,7 @@ of the entry (i.e. whether it is `Missing` or `T`).
Arrays allowing for missing values can be constructed with the standard syntax.
Use `Array{Union{Missing, T}}(missing, dims)` to create arrays filled with
missing values:

```jldoctest
julia> Array{Union{Missing, String}}(missing, 2, 3)
2×3 Matrix{Union{Missing, String}}:
Expand All @@ -264,10 +279,12 @@ julia> Array{Union{Missing, String}}(missing, 2, 3)
`missing`, but this is not the correct way to obtain such an array.
Use a `missing` constructor as shown above instead.

An array allowing for `missing` values but which does not contain any such value
can be converted back to an array which does not allow for missing values using
An array with element type allowing `missing` entries (e.g. `Vector{Union{Missing, T}}`)
which does not contain any `missing` entries can be converted to an array type that does
not allow for `missing` entries (e.g. `Vector{T}`) using
[`convert`](@ref). If the array contains `missing` values, a `MethodError` is thrown
during conversion
during conversion:

```jldoctest
julia> x = Union{Missing, String}["a", "b"]
2-element Vector{Union{Missing, String}}:
Expand All @@ -287,23 +304,27 @@ julia> y = Union{Missing, String}[missing, "b"]
julia> convert(Array{String}, y)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type String
```

## Skipping Missing Values

Since `missing` values propagate with standard mathematical operators, reduction
functions return `missing` when called on arrays which contain missing values
functions return `missing` when called on arrays which contain missing values:

```jldoctest
julia> sum([1, missing])
missing
```

In this situation, use the [`skipmissing`](@ref) function to skip missing values
In this situation, use the [`skipmissing`](@ref) function to skip missing values:

```jldoctest
julia> sum(skipmissing([1, missing]))
1
```

This convenience function returns an iterator which filters out `missing` values
efficiently. It can therefore be used with any function which supports iterators
efficiently. It can therefore be used with any function which supports iterators:

```jldoctest skipmissing; setup = :(using Statistics)
julia> x = skipmissing([3, missing, 2, 1])
skipmissing(Union{Missing, Int64}[3, missing, 2, 1])
Expand All @@ -320,8 +341,9 @@ julia> mapreduce(sqrt, +, x)

Objects created by calling `skipmissing` on an array can be indexed using indices
from the parent array. Indices corresponding to missing values are not valid for
these objects and an error is thrown when trying to use them (they are also skipped
by `keys` and `eachindex`)
these objects, and an error is thrown when trying to use them (they are also skipped
by `keys` and `eachindex`):

```jldoctest skipmissing
julia> x[1]
3
Expand All @@ -332,9 +354,10 @@ ERROR: MissingException: the value at index (2,) is missing
```

This allows functions which operate on indices to work in combination with `skipmissing`.
This is notably the case for search and find functions, which return indices
valid for the object returned by `skipmissing` which are also the indices of the
matching entries *in the parent array*
This is notably the case for search and find functions. These functions return indices
valid for the object returned by `skipmissing`, and are also the indices of the
matching entries *in the parent array*:

```jldoctest skipmissing
julia> findall(==(1), x)
1-element Vector{Int64}:
Expand All @@ -347,7 +370,8 @@ julia> argmax(x)
1
```

Use [`collect`](@ref) to extract non-`missing` values and store them in an array
Use [`collect`](@ref) to extract non-`missing` values and store them in an array:

```jldoctest skipmissing
julia> collect(x)
3-element Vector{Int64}:
Expand All @@ -362,9 +386,10 @@ The three-valued logic described above for logical operators is also used
by logical functions applied to arrays. Thus, array equality tests using
the [`==`](@ref) operator return `missing` whenever the result cannot be
determined without knowing the actual value of the `missing` entry. In practice,
this means that `missing` is returned if all non-missing values of the compared
this means `missing` is returned if all non-missing values of the compared
arrays are equal, but one or both arrays contain missing values (possibly at
different positions)
different positions):

```jldoctest
julia> [1, missing] == [2, missing]
false
Expand All @@ -377,7 +402,8 @@ missing
```

As for single values, use [`isequal`](@ref) to treat `missing` values as equal
to other `missing` values but different from non-missing values
to other `missing` values, but different from non-missing values:

```jldoctest
julia> isequal([1, missing], [1, missing])
true
Expand All @@ -387,7 +413,8 @@ false
```

Functions [`any`](@ref) and [`all`](@ref) also follow the rules of
three-valued logic, returning `missing` when the result cannot be determined
three-valued logic. Thus, returning `missing` when the result cannot be determined:

```jldoctest
julia> all([true, missing])
missing
Expand Down

0 comments on commit 05515f4

Please sign in to comment.