Skip to content

Commit

Permalink
Merge branch 'StatsBase2021' into nl/weightedstats
Browse files Browse the repository at this point in the history
  • Loading branch information
nalimilan committed Sep 25, 2021
2 parents 850d3e6 + 1e5d2a8 commit 29b230f
Show file tree
Hide file tree
Showing 46 changed files with 3,211 additions and 1,106 deletions.
26 changes: 26 additions & 0 deletions .github/workflows/CompatHelper.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: CompatHelper
on:
schedule:
- cron: 0 0 * * *
workflow_dispatch:
jobs:
CompatHelper:
runs-on: ubuntu-latest
steps:
- name: "Install CompatHelper"
run: |
import Pkg
name = "CompatHelper"
uuid = "aa819f21-2bde-4658-8897-bab36330d9b7"
version = "2"
Pkg.add(; name, uuid, version)
shell: julia --color=yes {0}
- name: "Run CompatHelper"
run: |
import CompatHelper
CompatHelper.main()
shell: julia --color=yes {0}
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
COMPATHELPER_PRIV: ${{ secrets.DOCUMENTER_KEY }}
# COMPATHELPER_PRIV: ${{ secrets.COMPATHELPER_PRIV }}
11 changes: 11 additions & 0 deletions .github/workflows/TagBot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: TagBot
on:
schedule:
- cron: 0 * * * *
jobs:
TagBot:
runs-on: ubuntu-latest
steps:
- uses: JuliaRegistries/TagBot@v1
with:
token: ${{ secrets.GITHUB_TOKEN }}
11 changes: 2 additions & 9 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,15 +52,8 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: julia-actions/setup-julia@v1
with:
version: '1'
- run: |
julia --project=docs -e '
using Pkg
Pkg.develop(PackageSpec(path=pwd()))
Pkg.instantiate()'
- run: julia --project=docs docs/make.jl
- uses: julia-actions/julia-buildpkg@latest
- uses: julia-actions/julia-docdeploy@latest
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }}
5 changes: 3 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
name = "Statistics"
uuid = "20745b16-79ce-11e8-11f9-7d13ad32a3b2"
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[deps]
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[extras]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Random", "Test"]
test = ["Dates", "Random", "Test"]
26 changes: 11 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,16 @@
## StatsBase.jl
# Statistics.jl

*StatsBase.jl* is a Julia package that provides basic support for statistics. Particularly, it implements a variety of statistics-related functions, such as scalar statistics, high-order moment computation, counting, ranking, covariances, sampling, and empirical density estimation.
[![Build status](https://github.com/JuliaLang/Statistics.jl/workflows/CI/badge.svg)](https://github.com/JuliaLang/Statistics.jl/actions?query=workflow%3ACI+branch%3Amaster)

- **Current Release**:
[![StatsBase](http://pkg.julialang.org/badges/StatsBase_0.5.svg)](http://pkg.julialang.org/?pkg=StatsBase)
[![StatsBase](http://pkg.julialang.org/badges/StatsBase_0.6.svg)](http://pkg.julialang.org/?pkg=StatsBase)
- **Build & Testing Status:**
[![Build Status](https://travis-ci.org/JuliaStats/StatsBase.jl.svg?branch=master)](https://travis-ci.org/JuliaStats/StatsBase.jl)
[![Build status](https://ci.appveyor.com/api/projects/status/fsut3j3onulvws1w?svg=true)](https://ci.appveyor.com/project/nalimilan/statsbase-jl)
[![Coverage Status](https://coveralls.io/repos/JuliaStats/StatsBase.jl/badge.svg?branch=master)](https://coveralls.io/r/JuliaStats/StatsBase.jl?branch=master)
[![Coverage Status](http://codecov.io/github/JuliaStats/StatsBase.jl/coverage.svg?branch=master)](http://codecov.io/github/JuliaStats/StatsBase.jl?branch=master)
Development repository for the Statistics standard library (stdlib) that ships with Julia.

- **Documentation**: [![][docs-stable-img]][docs-stable-url] [![][docs-latest-img]][docs-latest-url]
#### Using the development version of Statistics.jl

[docs-latest-img]: https://img.shields.io/badge/docs-latest-blue.svg
[docs-latest-url]: http://JuliaStats.github.io/StatsBase.jl/latest/
If you want to develop this package, do the following steps:
- Clone the repo anywhere.
- In line 2 of the `Project.toml` file (the line that begins with `uuid = ...`), modify the UUID, e.g. change the `107` to `207`.
- Change the current directory to the Statistics repo you just cloned and start julia with `julia --project`.
- `import Statistics` will now load the files in the cloned repo instead of the Statistics stdlib.
- To test your changes, simply do `include("test/runtests.jl")`.

[docs-stable-img]: https://img.shields.io/badge/docs-stable-blue.svg
[docs-stable-url]: http://JuliaStats.github.io/StatsBase.jl/stable/
If you need to build Julia from source with a git checkout of Statistics, then instead use `make DEPS_GIT=Statistics` when building Julia. The `Statistics` repo is in `stdlib/Statistics`, and created initially with a detached `HEAD`. If you're doing this from a pre-existing Julia repository, you may need to `make clean` beforehand.
6 changes: 3 additions & 3 deletions docs/src/empirical.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

## Histograms

The `Histogram` type represents data that has been tabulated into intervals
(known as *bins*) along the real line, or in higher dimensions, over the real
plane.
```@docs
Histogram
```

Histograms can be fitted to data using the `fit` method.

Expand Down
2 changes: 1 addition & 1 deletion docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ corrections where necessary.
Pages = ["weights.md", "scalarstats.md", "cov.md", "robust.md", "ranking.jl",
"empirical.md"]
Depth = 2
```
```
2 changes: 1 addition & 1 deletion docs/src/scalarstats.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,4 +71,4 @@ modes

```@docs
describe
```
```
83 changes: 81 additions & 2 deletions docs/src/weights.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,15 +64,91 @@ w = ProbabilityWeights([0.2, 0.1, 0.3])
w = pweights([0.2, 0.1, 0.3])
```

### `UnitWeights`

Unit weights are a special case in which all observations are given a weight equal to `1`. Using such weights is equivalent to computing unweighted statistics.

This type can notably be used when implementing an algorithm so that a only a weighted variant has to be written. The unweighted variant is then obtained by passing a `UnitWeights` object. This is very efficient since no weights vector is actually allocated.

```julia
w = uweights(3)
w = uweights(Float64, 3)
```

### `Weights`

The `Weights` type describes a generic weights vector which does not support all operations possible for `FrequencyWeights`, `AnalyticWeights` and `ProbabilityWeights`.
The `Weights` type describes a generic weights vector which does not support all operations possible for `FrequencyWeights`, `AnalyticWeights`, `ProbabilityWeights` and `UnitWeights`.

```julia
w = Weights([1., 2., 3.])
w = weights([1., 2., 3.])
```

### Exponential weights: `eweights`

Exponential weights are a common form of temporal weights which assign exponentially decreasing
weights to past observations.

If `t` is a vector of temporal indices then for each index `i` we compute the weight as:

``λ (1 - λ)^{1 - i}``

``λ`` is a smoothing factor or rate parameter such that ``0 < λ ≤ 1``.
As this value approaches 0, the resulting weights will be almost equal,
while values closer to 1 will put greater weight on the tail elements of the vector.

For example, the following call generates exponential weights for ten observations with ``λ = 0.3``.
```julia-repl
julia> eweights(1:10, 0.3)
10-element Weights{Float64,Float64,Array{Float64,1}}:
0.3
0.42857142857142855
0.6122448979591837
0.8746355685131197
1.249479383590171
1.7849705479859588
2.549957925694227
3.642797036706039
5.203995766722913
7.434279666747019
```

Simply passing the number of observations `n` is equivalent to passing in `1:n`.

```julia-repl
julia> eweights(10, 0.3)
10-element Weights{Float64,Float64,Array{Float64,1}}:
0.3
0.42857142857142855
0.6122448979591837
0.8746355685131197
1.249479383590171
1.7849705479859588
2.549957925694227
3.642797036706039
5.203995766722913
7.434279666747019
```

Finally, you can construct exponential weights from an arbitrary subset of timestamps within a larger range.

```julia-repl
julia> t
2019-01-01T01:00:00:2 hours:2019-01-01T05:00:00
julia> r
2019-01-01T01:00:00:1 hour:2019-01-02T01:00:00
julia> eweights(t, r, 0.3)
3-element Weights{Float64,Float64,Array{Float64,1}}:
0.3
0.6122448979591837
1.249479383590171
```

NOTE: This is equivalent to `eweights(something.(indexin(t, r)), 0.3)`, which is saying that for each value in `t` return the corresponding index for that value in `r`.
Since `indexin` returns `nothing` if there is no corresponding value from `t` in `r` we use `something` to eliminate that possibility.

## Methods

`AbstractWeights` implements the following methods:
Expand All @@ -90,9 +166,12 @@ AbstractWeights
AnalyticWeights
FrequencyWeights
ProbabilityWeights
UnitWeights
Weights
aweights
fweights
pweights
eweights
uweights
weights
```
```
7 changes: 5 additions & 2 deletions perf/sampling.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ using StatsBase

import StatsBase: direct_sample!, xmultinom_sample!
import StatsBase: knuths_sample!, fisher_yates_sample!, self_avoid_sample!
import StatsBase: seqsample_a!, seqsample_c!
import StatsBase: seqsample_a!, seqsample_c!, seqsample_d!

### generic sampling benchmarking

Expand Down Expand Up @@ -42,6 +42,9 @@ tsample!(s::Seq_A, a, x) = seqsample_a!(a, x)
mutable struct Seq_C <: NoRep end
tsample!(s::Seq_C, a, x) = seqsample_c!(a, x)

mutable struct Seq_D <: NoRep end
tsample!(s::Seq_D, a, x) = seqsample_d!(a, x)

mutable struct Sample_NoRep <: NoRep end
tsample!(s::Sample_NoRep, a, x) = sample!(a, x; replace=false, ordered=false)

Expand Down Expand Up @@ -87,6 +90,7 @@ const procs2 = Proc[ SampleProc{Knuths}(),
SampleProc{Sample_NoRep}(),
SampleProc{Seq_A}(),
SampleProc{Seq_C}(),
SampleProc{Seq_D}(),
SampleProc{Sample_NoRep_Ord}() ]

const cfgs2 = (Int, Int)[]
Expand All @@ -110,4 +114,3 @@ println("Sampling Without Replacement")
println("===================================")
show(rtable2; unit=:mps, cfghead="(n, k)")
println()

98 changes: 24 additions & 74 deletions src/Statistics.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ export std, stdm, var, varm, mean!, mean,
# moments.jl
skewness, kurtosis,
# weights.jl
AbstractWeights, Weights, AnalyticWeights, FrequencyWeights, ProbabilityWeights,
weights, aweights, fweights, pweights,
AbstractWeights, Weights, AnalyticWeights, FrequencyWeights, ProbabilityWeights, UnitWeights,
weights, aweights, eweights, fweights, pweights, uweights,
# scalarstats.jl
geomean, harmmean, genmean, mode, modes, percentile, span, variation, sem, mad, mad!,
iqr, genvar, totalvar, entropy, renyientropy, crossentropy, kldivergence, describe,
Expand Down Expand Up @@ -264,6 +264,16 @@ _mean(::typeof(identity), A::AbstractArray, dims::Colon, w::AbstractArray) =
_mean(::typeof(identity), A::AbstractArray, dims, w::AbstractArray) =
_mean!(Base.reducedim_init(t -> (t*zero(eltype(w)))/2, Base.add_sum, A, dims), A, w)

function _mean(::typeof(identity), A::AbstractArray, dims, w::UnitWeights)
size(A, dims) != length(w) && throw(DimensionMismatch("Inconsistent array dimension."))
return mean(A, dims=dims)
end

function _mean(::typeof(identity), A::AbstractArray, dims::Colon, w::UnitWeights)
length(A) != length(w) && throw(DimensionMismatch("Inconsistent array dimension."))
return mean(A)
end

##### variances #####

# faster computation of real(conj(x)*y)
Expand Down Expand Up @@ -451,78 +461,6 @@ function _varm(A::AbstractArray{T}, m, corrected::Bool, dims::Colon,
varcorrection(w, corrected) * s
end

"""
varcorrection(n::Integer, corrected=false)
Compute a bias correction factor for calculating `var`, `std` and `cov` with
`n` observations. Returns ``\\frac{1}{n - 1}`` when `corrected=true`
(i.e. [Bessel's correction](https://en.wikipedia.org/wiki/Bessel's_correction)),
otherwise returns ``\\frac{1}{n}`` (i.e. no correction).
"""
@inline varcorrection(n::Integer, corrected::Bool=false) = 1 / (n - Int(corrected))

"""
varcorrection(w::Weights, corrected=false)
Returns ``\\frac{1}{\\sum w}`` when `corrected=false` and throws an `ArgumentError`
if `corrected=true`.
"""
@inline function varcorrection(w::Weights, corrected::Bool=false)
corrected && throw(ArgumentError("Weights type does not support bias correction: " *
"use FrequencyWeights, AnalyticWeights or ProbabilityWeights if applicable."))
1 / w.sum
end

"""
varcorrection(w::AnalyticWeights, corrected=false)
* `corrected=true`: ``\\frac{1}{\\sum w - \\sum {w^2} / \\sum w}``
* `corrected=false`: ``\\frac{1}{\\sum w}``
"""
@inline function varcorrection(w::AnalyticWeights, corrected::Bool=false)
s = w.sum

if corrected
sum_sn = sum(x -> (x / s) ^ 2, w)
1 / (s * (1 - sum_sn))
else
1 / s
end
end

"""
varcorrection(w::FrequencyWeights, corrected=false)
* `corrected=true`: ``\\frac{1}{\\sum{w} - 1}``
* `corrected=false`: ``\\frac{1}{\\sum w}``
"""
@inline function varcorrection(w::FrequencyWeights, corrected::Bool=false)
s = w.sum

if corrected
1 / (s - 1)
else
1 / s
end
end

"""
varcorrection(w::ProbabilityWeights, corrected=false)
* `corrected=true`: ``\\frac{n}{(n - 1) \\sum w}`` where ``n`` equals `count(!iszero, w)`
* `corrected=false`: ``\\frac{1}{\\sum w}``
"""
@inline function varcorrection(w::ProbabilityWeights, corrected::Bool=false)
s = w.sum

if corrected
n = count(!iszero, w)
n / (s * (n - 1))
else
1 / s
end
end

"""
var(itr; corrected::Bool=true, [weights::AbstractWeights], mean=nothing[, dims])
Expand Down Expand Up @@ -1425,6 +1363,18 @@ function _quantile(v::AbstractArray{V}, p, sorted::Bool, alpha::Real, beta::Real
return out
end

function _quantile(v::AbstractArray, p, sorted::Bool,
alpha::Real, beta::Real, w::UnitWeights)
length(v) != length(w) && throw(DimensionMismatch("Inconsistent array dimension."))
return quantile(v, p)
end

function _quantile(v::AbstractArray, p::Real, sorted::Bool,
alpha::Real, beta::Real, w::UnitWeights)
length(v) != length(w) && throw(DimensionMismatch("Inconsistent array dimension."))
return quantile(v, p)
end

_quantile(v::AbstractArray, p::Real, sorted::Bool, alpha::Real, beta::Real,
w::AbstractArray) =
_quantile(v, [p], sorted, alpha, beta, w)[1]
Expand Down
Loading

0 comments on commit 29b230f

Please sign in to comment.