ENH: Add CDF and SF of the Wilcoxon distribution#124
Conversation
| inline std::vector<double> wilcoxon_pmf(int n) { | ||
| // PMF of the Wilcoxon signed-rank statistic. | ||
| // The returned vector has length n*(n+1)/2 + 1, and the k-th element contains | ||
| // the probability of observing a test statistic equal to k for n pairs of | ||
| // observations. | ||
| std::vector<double> pmf = {1.0}; | ||
| for (int k = 1; k < n + 1; k++) { | ||
| std::vector<double> tmp(k * (k + 1) / 2 + 1, 0.0); | ||
| int m = static_cast<int>(pmf.size()); | ||
| for (int i = 0; i < m; i++) { | ||
| tmp[i] += 0.5 * pmf[i]; | ||
| tmp[i + k] += 0.5 * pmf[i]; | ||
| } | ||
| pmf = std::move(tmp); | ||
| } | ||
| return pmf; | ||
| } |
There was a problem hiding this comment.
| inline double wilcoxon_cdf(double k, const std::vector<double> &cdf_table, int n) { | ||
| // CDF of the Wilcoxon signed-rank statistic for real k and n pairs of observations. | ||
| if (std::isnan(k) || n < 0) { | ||
| return std::numeric_limits<double>::quiet_NaN(); | ||
| } | ||
| return wilcoxon_cdf(static_cast<int>(k), cdf_table, n); | ||
| } |
There was a problem hiding this comment.
Maybe I am blind but I do not understand how this function would be called inside SciPy. Should we first generate the CDF table and then call this method? That would be highly unregular pattern for our special functions. I would expect there to be a simple wilcoxon_cdf function. Is this difficult here for performance reasons?
There was a problem hiding this comment.
@dschmitz89, I've worked out a pattern where the results of computations which are based on only some of a ufunc's args can be reused across iterations over the other args. See for instance this stateful kernel for for the poisson binomial pmf in SciPy
/* gufunc kernels which use internal caches.
*
* Such caches live only during the course of a single call to the ufunc
* in order to eliminate redundant calculations. This does not concern
* user-facing caching across ufunc calls.
*
* In order for such caches to serve their purpose, the iteration order over the
* course of the ufunc loops must be chosen so that elements (or core slices) of
* the inputs that go into the cached computation vary most slowly and elements
* of the inputs for which the cached result can be reused vary most quickly.
* This can be accomplished by wrapping the gufunc with `_with_cache_optimization`
* from ./_ufunc_tools.py. See the docstring of `_with_cache_optimization` for
* more information.
*/
template <typename T_1d>
struct _poisson_binom_pmf_kernel {
using T = typename T_1d::value_type;
std::vector<T> dist;
T* last_p_ptr = nullptr;
T operator()(long long int k, T_1d p) {
if (!last_p_ptr) {
dist.resize(p.extent(0) + 1);
}
if (p.data_handle() != last_p_ptr) {
/* If p is stepped through in the outer loops and k in the inner loops,
* this will yield a cache hit for each inner iteration. The cache is
* overwritten unconditionally whenever this steps to a new slice of p. */
xsf::poisson_binom_pmf_all(p, _as_mdspan(dist));
last_p_ptr = p.data_handle();
}
return xsf::take_from_pmf(_as_mdspan(dist), k);
}
};what @fbourgey has here is not quite right though. To ensure that these xsf kernels will also work in CuPy, they should be using template arguments meant to be instantiated with mdspans, rather than using std::vector. The kernels for generating the pmf and cdf tables should follow the same pattern used for the corresponding Poisson Binom kernels
Lines 538 to 578 in f7b85f5
Also @fbourgey, you should be using xsf::take_from_discrete_cdf and xsf::take_from_pmf
Lines 580 to 601 in f7b85f5
not writing distribution specific functions to do this work.
There was a problem hiding this comment.
Thanks all. I need to update this PR following what @steppi implemented for poisson_binom_pmf. Will try to do it asap. This PR was submitted before we added this pattern. Changing to draft for now.
Reference issue
Toward #98
What does this implement/fix?
nare efficient.