Skip to content

ENH: Add CDF and SF of the Wilcoxon distribution#124

Draft
fbourgey wants to merge 3 commits into
scipy:mainfrom
fbourgey:wilcoxon
Draft

ENH: Add CDF and SF of the Wilcoxon distribution#124
fbourgey wants to merge 3 commits into
scipy:mainfrom
fbourgey:wilcoxon

Conversation

@fbourgey
Copy link
Copy Markdown
Member

Reference issue

Toward #98

What does this implement/fix?

  • Implement the CDF and SF of the Wilcoxon signed-rank distribution.
  • Port the exact distribution logic from scipy.stats._wilcoxon.py
  • Precompute a CDF table from the PMF so repeated CDF/SF queries for a given n are efficient.

Comment thread include/xsf/stats.h
Comment on lines +303 to +319
inline std::vector<double> wilcoxon_pmf(int n) {
// PMF of the Wilcoxon signed-rank statistic.
// The returned vector has length n*(n+1)/2 + 1, and the k-th element contains
// the probability of observing a test statistic equal to k for n pairs of
// observations.
std::vector<double> pmf = {1.0};
for (int k = 1; k < n + 1; k++) {
std::vector<double> tmp(k * (k + 1) / 2 + 1, 0.0);
int m = static_cast<int>(pmf.size());
for (int i = 0; i < m; i++) {
tmp[i] += 0.5 * pmf[i];
tmp[i + k] += 0.5 * pmf[i];
}
pmf = std::move(tmp);
}
return pmf;
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread include/xsf/stats.h
Comment on lines +360 to +366
inline double wilcoxon_cdf(double k, const std::vector<double> &cdf_table, int n) {
// CDF of the Wilcoxon signed-rank statistic for real k and n pairs of observations.
if (std::isnan(k) || n < 0) {
return std::numeric_limits<double>::quiet_NaN();
}
return wilcoxon_cdf(static_cast<int>(k), cdf_table, n);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am blind but I do not understand how this function would be called inside SciPy. Should we first generate the CDF table and then call this method? That would be highly unregular pattern for our special functions. I would expect there to be a simple wilcoxon_cdf function. Is this difficult here for performance reasons?

Copy link
Copy Markdown
Member

@steppi steppi May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dschmitz89, I've worked out a pattern where the results of computations which are based on only some of a ufunc's args can be reused across iterations over the other args. See for instance this stateful kernel for for the poisson binomial pmf in SciPy

https://github.com/scipy/scipy/blob/851124cf2893187a07ade72d757938cc145b79fd/scipy/special/_gufuncs.cpp#L72-L107

/* gufunc kernels which use internal caches.
 *
 * Such caches live only during the course of a single call to the ufunc
 * in order to eliminate redundant calculations. This does not concern
 * user-facing caching across ufunc calls.
 *
 * In order for such caches to serve their purpose, the iteration order over the
 * course of the ufunc loops must be chosen so that elements (or core slices) of
 * the inputs that go into the cached computation vary most slowly and elements
 * of the inputs for which the cached result can be reused vary most quickly.
 * This can be accomplished by wrapping the gufunc with `_with_cache_optimization`
 * from ./_ufunc_tools.py. See the docstring of `_with_cache_optimization` for
 * more information.
 */


template <typename T_1d>
struct _poisson_binom_pmf_kernel {
    using T = typename T_1d::value_type;


    std::vector<T> dist;
    T* last_p_ptr = nullptr;
    T operator()(long long int k, T_1d p) {
        if (!last_p_ptr) {
            dist.resize(p.extent(0) + 1);
        }
        if (p.data_handle() != last_p_ptr) {
            /* If p is stepped through in the outer loops and k in the inner loops,
             * this will yield a cache hit for each inner iteration. The cache is
             * overwritten unconditionally whenever this steps to a new slice of p. */
            xsf::poisson_binom_pmf_all(p, _as_mdspan(dist));
            last_p_ptr = p.data_handle();
        }


        return xsf::take_from_pmf(_as_mdspan(dist), k);
    }
};

what @fbourgey has here is not quite right though. To ensure that these xsf kernels will also work in CuPy, they should be using template arguments meant to be instantiated with mdspans, rather than using std::vector. The kernels for generating the pmf and cdf tables should follow the same pattern used for the corresponding Poisson Binom kernels

xsf/include/xsf/stats.h

Lines 538 to 578 in f7b85f5

template <typename InputMat, typename OutputMat>
XSF_HOST_DEVICE inline void poisson_binom_pmf_all(InputMat p, OutputMat res) {
/* Outputs the full pmf of a Poisson-Binomial distribution
*
* p should be an std::mdspan view of a 1d array of length n containing
* the success probabilities for the n Bernoulli trials. res should be
* a std::mdspan view of a 1d array of of length n + 1. It is up to
* the caller to pass valid values for p and res.
*
* Upon completion, res(k) will contain the probability of observing k
* successes for k from 0 to n.
*/
auto n = p.extent(0);
auto out_size = res.extent(0);
if (out_size != n + 1) {
set_error("_poisson_binom_pmf_all", SF_ERROR_MEMORY, "out.shape[-1] must be p.shape[-1] + 1");
return;
}
detail::poisson_binom_pmf_all_impl(p, res);
}
template <typename InputMat, typename OutputMat>
XSF_HOST_DEVICE inline void poisson_binom_cdf_all(InputMat p, OutputMat res) {
/* Outputs the full cdf of a Poisson-Binomial distribution */
using T = typename OutputMat::value_type;
auto n = p.extent(0);
auto out_size = res.extent(0);
if (out_size != n + 1) {
set_error("_poisson_binom_cdf_all", SF_ERROR_MEMORY, "out.shape[-1] must be p.shape[-1] + 1");
return;
}
detail::poisson_binom_pmf_all_impl(p, res);
for (decltype(n) i = 1; i < n; i++) {
res(i) = std::min(res(i) + res(i - 1), T(1.0));
}
res(n) = T(1.0);
}

Also @fbourgey, you should be using xsf::take_from_discrete_cdf and xsf::take_from_pmf

xsf/include/xsf/stats.h

Lines 580 to 601 in f7b85f5

template <typename InputMat>
XSF_HOST_DEVICE inline typename InputMat::value_type take_from_pmf(InputMat pmf, long long int k) {
using T = typename InputMat::value_type;
auto size = pmf.extent(0);
if ((k < 0) || (k >= static_cast<long long int>(size))) {
return T(0.0);
}
return pmf(k);
}
template <typename InputMat>
XSF_HOST_DEVICE inline typename InputMat::value_type take_from_discrete_cdf(InputMat cdf, long long int k) {
using T = typename InputMat::value_type;
auto size = cdf.extent(0);
if (k < 0) {
return T(0.0);
}
if (k >= static_cast<long long int>(size) - 1) {
return T(1.0);
}
return cdf(k);
}

not writing distribution specific functions to do this work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all. I need to update this PR following what @steppi implemented for poisson_binom_pmf. Will try to do it asap. This PR was submitted before we added this pattern. Changing to draft for now.

@fbourgey fbourgey marked this pull request as draft May 20, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants