WIP: more accurate KSDist function #254

bicycle1885 · 2014-06-28T16:13:15Z

This is a work-in-progress implementation of Simard and L'Ecuyer (2011) meta-algorithm http://www.jstatsoft.org/v39/i11/paper.
The current commit contains debug code and duplicated functions in order to check the differences between the current and the proposed implementation.

I've used KolmogorovSmirnovDist.c as a reference, which is an implementation of the meta-algorithm. http://simul.iro.umontreal.ca/ksdir/
At some inputs, there seems to be serious dispersion between the current one and the reference.
For example when n = 5 and x = 2.256281e-01 the CDFs are:

reference	current	proposed
8.852118e-02	9.711784e-02	9.711784e-02

In this case, the cdf_durbin function is in charge, which is reused in the proposed implementation.

For other example when n = 1000000 and x = -3.366834e-02 the CDFs are completely different:

reference	current	proposed
1.000000e+00	0.000000e+00	1.000000e+00

What I'm wondering are:

in the first example, is the current implementation wrong?
in the second example, returning zero looks to be sensible, am I right?

Here is the complete log of comparison: out.txt
(CAUTION - it's 696KB)

And here is the way for quick checking and reproducing it:

# check out my branch
curl -O -O http://simul.iro.umontreal.ca/ksdir/KolmogorovSmirnovDist.{c,h}
clang -fPIC -shared -dynamiclib KolmogorovSmirnovDist.c -o libksd.dylib
julia check.jl > out.txt

check.jl:

using Distributions

function _cdf(d, x)
    ccall((:KScdf, "libksd"), Float64, (Int, Float64), d.n, x)
end

begin
    ns = vcat([1:50], [100, 200, 500, 1000, 2000, 10_000, 50_000, 100_000, 500_000, 1_000_000])
    xs = linspace(-0.1, 1.1, 200)
    n_all = 0
    n_eq1 = 0
    n_eq2 = 0
    println("n\tx\treference\tcurrent\tproposed\tcurrent_ok\tproposed_ok")
    for n in ns
        ksd = Distributions.KSDist(n)
        for x in xs
            v = _cdf(ksd, x)
            v1 = Distributions.cdf(ksd, x)
            v2 = Distributions.cdf2(ksd, x)
            eq1 = isapprox(v, v1)
            eq2 = isapprox(v, v2)
            @printf "%d\t%12.6e\t%12.6e\t%12.6e\t%12.6e\t%s\t%s\n" n x v v1 v2 (eq1 ? "o" : "x") (eq2 ? "o" : "x")

            n_all += 1
            if eq1
                n_eq1 += 1
            end
            if eq2
                n_eq2 += 1
            end
        end
    end

    println(STDERR, "1: $n_eq1/$n_all")
    println(STDERR, "2: $n_eq2/$n_all")
end

Thank you.

johnmyleswhite · 2014-06-28T16:16:47Z

This looks like really nice work. Thanks so much for working on this problem. When you say reference, did you check against output or did you read the source code? If the latter, what's the license of the reference code?

coveralls · 2014-06-28T16:17:40Z

Coverage remained the same when pulling 5a20d3d on bicycle1885:ksdist into 9b897a2 on JuliaStats:master.

bicycle1885 · 2014-06-28T16:24:44Z

The reference is a C code, so I wrote a thin wrapper in Julia to get the reference values.
The final pull request will not contain the reference code, but I'm not sure whether the license (GNU GPL v3) of the reference program propagates to my translated program

johnmyleswhite · 2014-06-28T16:25:55Z

If you translated the C to Julia, then the license propagates. If you just called the C code, the license will not propagate.

bicycle1885 · 2014-06-28T16:29:14Z

So it would be nice for me to contact the original author and get the permission to distribute my port under the MIT license.

bicycle1885 · 2014-06-28T22:45:47Z

I've prepended the license term of GNU GPL v3 to the source code file.

coveralls · 2014-06-28T22:47:26Z

Coverage remained the same when pulling c9bc19e on bicycle1885:ksdist into 9b897a2 on JuliaStats:master.

simonbyrne · 2014-06-30T09:51:50Z

Thanks for looking at this: I was scared off it previously. At the moment we're trying to keep everything in the package under permissive MIT/BSD-style licences. It would be great if you could contact the author for permission to use such a licence for the port.

bicycle1885 · 2014-07-01T03:08:33Z

I've sent an email to the author via Gmail but it was rejected by the recipient server...

I'll try later with other email address.

JLTastet · 2018-07-04T12:41:45Z

@bicycle1885 Sorry for the necrobump, but did you make any progress or got a reply from the author in the past few years ?

I am asking this because I came across a discrepancy when comparing the distribution to Monte-Carlo generated data. I ran a KS test many times for samples generated under the null hypothesis, then histogrammed the test statistics values and the corresponding p-values.

See figures below (for N=5). The PDF was computed from the CDF using centered finite difference.

nalimilan · 2018-07-05T12:35:32Z

src/univariate/ksdist.jl

@@ -1,5 +1,37 @@
+# This program is a Julia port of KolmogorovSmirnovDist.c, which is 
+# distributed under the GNU GPL v3 license.


Note that GPL3 isn't compatible with the license of this package, which is MIT. So I'm afraid that code cannot live in Distributions.jl.

EDIT: I've just realized this had already been mentioned, I thought this was a new PR.

I had a look at the paper to see if there was any pseudo-code which could have been the base for a clean-room implementation, but there does not seem to be anything except for a quick description of the Pomeranz recursion formula.

If this issue becomes a blocker for my current project, I could try to implement it from the primary sources, but I cannot promise anything right now.

I also had a quick look at what was available in other languages:

R has ks.test, but it is GPLv3 as well.

SciPy is BSD3-licensed and implements scipy.stats.kstest. I don't know if its computation of the CDF is any better, but if it is then we could probably use it here.

matbesancon · 2019-02-15T15:42:16Z

hi @bicycle1885, are there some plans for this feature to be integrated? The main problem seems to be the impossibility to port the original code. Some other way this could be done?

WIP: more accurate KSDist function

5a20d3d

add the license term

c9bc19e

lindahua added new-distr and removed new-distr labels Jul 19, 2014

nalimilan reviewed Jul 5, 2018

View reviewed changes

matbesancon added the wont-merge PR will not be merged, kept for archiving purpose label Feb 28, 2019

devmotion mentioned this pull request Nov 25, 2021

Polya-Gamma distribution #1440

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: more accurate KSDist function #254

WIP: more accurate KSDist function #254

bicycle1885 commented Jun 28, 2014

johnmyleswhite commented Jun 28, 2014

coveralls commented Jun 28, 2014

bicycle1885 commented Jun 28, 2014

johnmyleswhite commented Jun 28, 2014

bicycle1885 commented Jun 28, 2014

bicycle1885 commented Jun 28, 2014

coveralls commented Jun 28, 2014

simonbyrne commented Jun 30, 2014

bicycle1885 commented Jul 1, 2014

JLTastet commented Jul 4, 2018

nalimilan Jul 5, 2018 •

edited

Loading

JLTastet Jul 5, 2018

matbesancon commented Feb 15, 2019

		@@ -1,5 +1,37 @@
		# This program is a Julia port of KolmogorovSmirnovDist.c, which is
		# distributed under the GNU GPL v3 license.

WIP: more accurate KSDist function #254

Are you sure you want to change the base?

WIP: more accurate KSDist function #254

Conversation

bicycle1885 commented Jun 28, 2014

johnmyleswhite commented Jun 28, 2014

coveralls commented Jun 28, 2014

bicycle1885 commented Jun 28, 2014

johnmyleswhite commented Jun 28, 2014

bicycle1885 commented Jun 28, 2014

bicycle1885 commented Jun 28, 2014

coveralls commented Jun 28, 2014

simonbyrne commented Jun 30, 2014

bicycle1885 commented Jul 1, 2014

JLTastet commented Jul 4, 2018

nalimilan Jul 5, 2018 • edited Loading

Choose a reason for hiding this comment

JLTastet Jul 5, 2018

Choose a reason for hiding this comment

matbesancon commented Feb 15, 2019

nalimilan Jul 5, 2018 •

edited

Loading