Fix: don't sample elements with weight 0#983
Fix: don't sample elements with weight 0#983graidl wants to merge 13 commits intoJuliaStats:masterfrom
Conversation
|
Good catch. Could you add tests? |
|
Concerning tests: There are already several tests for weighted sampling, and in particular if we keep the latter solution in which just |
src/sampling.jl
Outdated
| 1 == firstindex(wv) || | ||
| throw(ArgumentError("non 1-based arrays are not supported")) | ||
| wsum = sum(wv) | ||
| wsum = foldl(+, wv) # instead of sum(wv) for avoiding numerical discrepancies with cw |
There was a problem hiding this comment.
sum(wv) simply retrieves the value that is stored in the AbstractWeights objects. The problem with this solution is that it would go through one additional time, making the function (more than) twice slower AFAICT.
There was a problem hiding this comment.
Right! I didn't know that. Then I suggest to stay with the first version, and now also included a test covering the former issue.
|
If we decide on the version with an |
|
Thanks. Unfortunately, according to Codecov the new branch isn't taken. |
|
Indeed Codecov does not take the new branch, thanks for pointing this out. This surprised me, as this branch is definitely taken when running the code in a normal Julia session with default parameters. A more detailed analysis turned out the following: This yields 0.66241294 when performed in some test, while 0.662413 is obtained in the REPL, and this makes all the difference! Therefore, I tried to find a seed value in the test environment that triggers the issue and therefore covers the new branch. However, despite trying a huge number of possible values, I failed. I also tried to find out why the sum calculation actually is different in the test environment than in the REPL but failed to find the reason. The parameters with which Julia was started as obtained by I am stuck now. If it really is that important to cover the new branch, I would need specific guidance on how to reproduce the behavior of a normal Julia session in the test environment in respect to the behavior of the sum calculation over a Float32 vector. |
|
Ah that's weird. FWIW I also get 0.66241294 here. Have you tried running Julia with |
|
It seems you are perfectly right! Starting Julia with On the other hand, a different rather new PC with a AMD Ryzen 9 9950X supporting SIMD started without So, what shall we do? I guess it is hardly possible to enforce the usage of SIMD - if it is supported at all on the used test server when running tests? |
|
There are lots of different SIMD instruction sets. The best we can do is prepare tests for the different values you encountered, and maybe print a warning when the sum isn't equal to one of these values so that in the future we are reminded to adjust the test. |
|
Ok, I added a corresponding check and warning as well as a comment explaining the situation. |
|
Thanks. But this still doesn't cover the lines on GitHub CI AFAICT. Would you be able to find values which reproduce the problem there (with 0.66241294 IIUC)? |
test/sampling.jl
Outdated
| # Without SIMD support, sum(w) == 0.66241294f0 and this test cannot check the | ||
| # resolution of the issue. | ||
| if sum(w) ∉ (0.662413f0, 0.66241294f0) | ||
| @warn "So far unrecognized value for sum(w) encountered." |
There was a problem hiding this comment.
| @warn "So far unrecognized value for sum(w) encountered." | |
| @warn "So far unrecognized value for sum(w) encountered. " * | |
| "Please update test so that it continues to cover special code path." |
test/runtests.jl
Outdated
| # "weights", | ||
| # "moments", | ||
| # "scalarstats", | ||
| # "deviation", | ||
| # "cov", | ||
| # "counts", | ||
| # "ranking", | ||
| # "empirical", | ||
| # "hist", | ||
| # "rankcorr", | ||
| # "signalcorr", | ||
| # "misc", | ||
| # "pairwise", | ||
| # "reliability", | ||
| # "robust", |
There was a problem hiding this comment.
| # "weights", | |
| # "moments", | |
| # "scalarstats", | |
| # "deviation", | |
| # "cov", | |
| # "counts", | |
| # "ranking", | |
| # "empirical", | |
| # "hist", | |
| # "rankcorr", | |
| # "signalcorr", | |
| # "misc", | |
| # "pairwise", | |
| # "reliability", | |
| # "robust", | |
| "weights", | |
| "moments", | |
| "scalarstats", | |
| "deviation", | |
| "cov", | |
| "counts", | |
| "ranking", | |
| "empirical", | |
| "hist", | |
| "rankcorr", | |
| "signalcorr", | |
| "misc", | |
| "pairwise", | |
| "reliability", | |
| "robust", |
|
No, as mentioned I was not able to find any case where the issue is triggered and thus the branch handling it is executed in the case without SIMD support, despite I checked a huge number of cases (i.e. seed values). It seems that without SIMD, the issue does not manifest. |
|
Ah got it. That makes sense actually. Have you tried with larger inputs? |
|
Will try with larger inputs! And sorry for accidentally committing runtests.jl. |
|
The current tests seem too complex and too brittle to me. Can't we just add a very simple artificial test for this branch with a weight vector with incorrectly large predefined sum (you can provide the sum when you construct the weight vector), such that after summing through all weights we're below rand() * very_large_incorrect_sum, so we hit the new branch? |
|
Ah yes good idea! |
|
Makes sense! I adapted the test in this direction. Still, I wanted to keep the scenario that actually triggers the issue when SIMD support is available, and additionally, there is now also a trivial test case with integer weights. |
| wsample(a::AbstractArray, w::AbstractVector{<:Real}, dims::Dims; | ||
| replace::Bool=true, ordered::Bool=false) = | ||
| wsample(default_rng(), a, w, dims; replace=replace, ordered=ordered) | ||
|
|
| w = Float32[0.0437019, 0.04302464, 0.039748967, 0.040406376, 0.042578973, | ||
| 0.040906563, 0.039586294, 0.04302464, 0.042357873, 0.04302464, 0.039262936, | ||
| 0.040406376, 0.040406376, 0.041919112, 0.041484896, 0.04057242, 0.0] |
There was a problem hiding this comment.
| w = Float32[0.0437019, 0.04302464, 0.039748967, 0.040406376, 0.042578973, | |
| 0.040906563, 0.039586294, 0.04302464, 0.042357873, 0.04302464, 0.039262936, | |
| 0.040406376, 0.040406376, 0.041919112, 0.041484896, 0.04057242, 0.0] | |
| w = Float32[0.0437019, 0.04302464, 0.039748967, 0.040406376, 0.042578973, | |
| 0.040906563, 0.039586294, 0.04302464, 0.042357873, 0.04302464, 0.039262936, | |
| 0.040406376, 0.040406376, 0.041919112, 0.041484896, 0.04057242, 0.0] |
| rng = StableRNG(889858990530) | ||
| s = sample(rng, Weights(w, 0.662413f0)) | ||
| @test s == length(w) - 1 | ||
| @test sample(rng, Weights([1, 2, 0, 0], 10000)) == 2 # another more trivial test |
There was a problem hiding this comment.
| @test sample(rng, Weights([1, 2, 0, 0], 10000)) == 2 # another more trivial test | |
| # Artificial test with provided sum greater than actual sum | |
| @test sample(rng, Weights([1, 2, 0, 0], 10000)) == 2 |
This PR fixes an issue in
sample(::AbstractRNG, ::AbstractWeights)as described in #982:When using floating point weights, a last element could have been chosen even if it has weight zero.
This could have happend as
sum(wv)is in general not exactly the value obtained by sequentially adding up all individual weights inwvdue to numerical imprecisions.In this PR this corner case is catched and the index of the last non-zero weight is returned.