-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import weighted stats and moments from StatsBase to Statistics #31395
Conversation
👍 to dumping them.
Unless we're going to support arbitrary moments (or just 5th or higher), I'd say leave it out.
👍 to supporting it. |
Out of curiosity, what's the advantage with having this in Statistics? I only ask because there are still some open issues with the weights type in StatsBase that we might want to address (e.g., n-dimensional weights, mutability of weights). |
One big one is better APIs, for example |
Oh, yeah, I guess that makes sense. Looks like github also supports transferring issues to another repo https://help.github.com/en/articles/transferring-an-issue-to-another-repository. I would prefer that we try to retain the history for the files that we're actively copying over. |
Another advantage is to eventually remove StatsBase, whose name is confusing since it's not more "base" than Statistics (quite the contrary actually).
I'm not sure that's really possible, as the code cannot be imported as-is and pass tests. So we would have to keep broken commits in history, which isn't great for bisecting. We can still refer to the StatsBase repo for history, though. |
This includes methods for mean, quantile, median, var, std, cov and cor, plus new functions skewness and kurtosis, and weight types. Code is copied from StatsBase with some cleanup where needed, in particular for dispatch, to move from `@nloops`/`@nrefs` to cartesian indexing and to be closer to the mapreducedim code. Weights are now passed via a keyword argument rather than by dispatching on AbstractWeights, so as to support any array where all weights types give the same result.
655603d
to
1f7b3d9
Compare
I've implemented I've also cleaned a few things and checked that performance hasn't regressed, so I think the PR is now ready for a serious review. This is a large piece of code, and the reductions are particularly tricky, so double-checking would really be appreciated. |
isempty(r) && return oftype((first(r) + last(r)) / 2, NaN) | ||
(first(r) + last(r)) / 2 | ||
end | ||
|
||
median(r::AbstractRange{<:Real}) = mean(r) | ||
_mean(A::AbstractArray, dims, weights::Nothing) = | ||
_mean!(Base.reducedim_init(t -> t/2, Base.add_sum, A, dims), A, nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code used +
instead of add_sum
, but I figured it would be more consistent with sum
. In practice I don't think it makes a difference for standard types because of the /2
which changes the type to floating point.
The Statistics module contains basic statistics functionality. | ||
The Statistics module contains basic statistics functionality: mean, median, quantiles, | ||
standard deviation, variance, skewness, kurtosis, correlation and covariance. | ||
Statistics can be weighted, and several weights types are distinguished to apply appropriate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should read "several weight types" instead of "several weights types". If you're referring to the actual type then " several AbstractWeights
types" maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "weight type"? The plural sounds more appropriate since weights only make sense as a series of values. Though "types of weights" might be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just sounded weird to me. I think "types of weights" is what sounds best.
|
||
""" | ||
var(itr; dims, corrected::Bool=true, mean=nothing) | ||
var(itr; corrected::Bool=true, [weights::AbstractWeights], [mean], [dims]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should support weights::AbstractArray
for consistency with other methods like mean
? We could always convert non-weight arrays with Weights(weights)
. Same applies below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we can't support arbitrary vectors since there's no varcorrection
for them. Weights
isn't supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but you should still be able to do var(itr; weights=..., corrected=false)
.
varcorrection(w::AnalyticWeights, corrected=false) | ||
|
||
* `corrected=true`: ``\\frac{1}{\\sum w - \\sum {w^2} / \\sum w}`` | ||
* `corrected=false`: ``\\frac{1}{\\sum w}`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to include links here as well (e.g., https://en.wikipedia.org/wiki/Weighted_arithmetic_mean)
|
||
wsum = sum(w) | ||
wsum == 0 && throw(ArgumentError("weight vector cannot sum to zero")) | ||
length(v) == length(w) || throw(ArgumentError("data and weight vectors must be the same size," * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need to split it over multiple lines then you might want to just use an if
?
x < 0 && throw(ArgumentError("weight vector cannot contain negative entries")) | ||
end | ||
|
||
isa(w, FrequencyWeights) && !(eltype(w) <: Integer) && any(!isinteger, w) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, maybe just use an if
block.
end | ||
|
||
Base.isequal(x::AbstractWeights, y::AbstractWeights) = false | ||
Base.:(==)(x::AbstractWeights, y::AbstractWeights) = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a newline at end of file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that's really possible, as the code cannot be imported as-is and pass tests. So we would have to keep broken commits in history, which isn't great for bisecting. We can still refer to the StatsBase repo for history, though.
How do you propose we refer to the StatsBase history? FWIW, I'd prefer that we retain history for this data type as a lot of changes/decisions were made while they were being developed. It is also helpful for github's suggested reviewers feature. Finally, I think this would be better to add after the stdlibs have been moved into separate packages as julia can just pin the Statistics version to a particular release, but folks can easily free/update to these new features regardless of which julia version they're using. It'll also be easier add features and bug fixes for these types without depending on the next julia release.
As I said, I don't know whether that's possible, and I'm not aware of a precedent. Do you have ideas?
What would be the advantage of waiting? We can keep exporting the types from StatsBase (and just re-exporting when Statistics provides them), so that it keeps working as it does on all Julia versions. Then once stdlibs can be versioned, people will be able to start depending only on a given Statistics version, and drop the StatsBase dependency. But waiting blocks any progress on the StatsBase front (this PR is just a small part of the needed work). |
cm3 = cm2 * z # empirical 3rd centered moment | ||
n = 1 | ||
y = iterate(x, s) | ||
while y !== nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this kind of loop is slower than what can be achieved for AbstractArray
using @inbounds @simd for i in eachindex(A)
. The only solution AFAICT is to add a special method for AbstractArray
-- but better do that in another PR as this one is quite large already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason that you can't use @inbounds @simd for i in eachindex(A)
because you're starting on the 2nd iterate in this loop?
|
||
!!! note | ||
- The weight vector is a light-weight wrapper of the input vector. | ||
The input vector is NOT copied during construction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input vector is NOT copied during construction. | |
The input vector is *not* copied during construction. |
# Return the NaN of the type that we would get, had this collection | ||
# contained any elements (this is consistent with var) | ||
z0 = zero(T) - zero(m) | ||
return (z0^3 + z0^3)/sqrt((z0^2+z0^2)^3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of initialization is probably overzealous, but I always find it hard to decide what simplifications are OK.
check_reducedims(R,A) | ||
reddims = size(R) .!= size(A) | ||
dim = something(findfirst(reddims), ndims(R)+1) | ||
if dim > N |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is quite ugly but I'm not sure what's the best solution. For unweighted sum, reducing over dim > N
is a no-op, so that's easy, but for the weighted sum it amounts to multiplying values by their corresponding weight. Maybe this should just be an error?
|
||
""" | ||
var(itr; dims, corrected::Bool=true, mean=nothing) | ||
var(itr; corrected::Bool=true, [weights::AbstractWeights], [mean], [dims]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we can't support arbitrary vectors since there's no varcorrection
for them. Weights
isn't supported.
The Statistics module contains basic statistics functionality. | ||
The Statistics module contains basic statistics functionality: mean, median, quantiles, | ||
standard deviation, variance, skewness, kurtosis, correlation and covariance. | ||
Statistics can be weighted, and several weights types are distinguished to apply appropriate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "weight type"? The plural sounds more appropriate since weights only make sense as a series of values. Though "types of weights" might be better.
Yes, we do this a reasonable amount at work.
NOTE: The reason I'd prefer to wait till Statistics is a separate repo is that it'll make the history cleaner because the
|
Let's continue this at JuliaStats/Statistics.jl#2. |
You can pull the history in and have it match up by using |
Can someone give me an update on what work needs to be done on this? Is it all waiting on https://github.com/JuliaLang/Statistics.jl/pull/2? Or are there design decisions that need to be made still? |
Continuation is at JuliaLang/Statistics.jl#2. |
This includes methods for
mean
,quantile
,median
,var
,std
,cov
andcor
, plus new functionsskewness
andkurtosis
, and weight types. Code is copied from StatsBase with some cleanup where needed, in particular for dispatch, to move from@nloops
/@nrefs
to cartesian indexing and tobe closer to the
mapreducedim
code. Weights are now passed via a keyword argument rather than by dispatching onAbstractWeights
, so as to support any array where all weights types give the same result.Still to address:
mean_and_var
andmean_and_std
? It's not too hard to dom = mean(...); s = std(..., mean=m)
manually. Also the names aren't great. UPDATE: dropped.moment
? It feels redundant withmean
,var
,skewness
andkurtosis
. UPDATE: dropped.sum
? For now I've keptwsum
internal (it is used bymean
), it can always be implemented later in Base. UPDATE: added.@nloop
/@nref
to cartesian indexing. UPDATE: performance is similar, and even sometimes faster.AbstractWeights
weights are supported).Closes https://github.com/JuliaLang/julia/issues/29974. See #27152 (comment) for an outline of a broader roadmap.