Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: more accurate KSDist function #254

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bicycle1885
Copy link
Contributor

This is a work-in-progress implementation of Simard and L'Ecuyer (2011) meta-algorithm http://www.jstatsoft.org/v39/i11/paper.
The current commit contains debug code and duplicated functions in order to check the differences between the current and the proposed implementation.

I've used KolmogorovSmirnovDist.c as a reference, which is an implementation of the meta-algorithm. http://simul.iro.umontreal.ca/ksdir/
At some inputs, there seems to be serious dispersion between the current one and the reference.
For example when n = 5 and x = 2.256281e-01 the CDFs are:

reference current proposed
8.852118e-02 9.711784e-02 9.711784e-02

In this case, the cdf_durbin function is in charge, which is reused in the proposed implementation.

For other example when n = 1000000 and x = -3.366834e-02 the CDFs are completely different:

reference current proposed
1.000000e+00 0.000000e+00 1.000000e+00

What I'm wondering are:

  • in the first example, is the current implementation wrong?
  • in the second example, returning zero looks to be sensible, am I right?

Here is the complete log of comparison: out.txt
(CAUTION - it's 696KB)

And here is the way for quick checking and reproducing it:

# check out my branch
curl -O -O http://simul.iro.umontreal.ca/ksdir/KolmogorovSmirnovDist.{c,h}
clang -fPIC -shared -dynamiclib KolmogorovSmirnovDist.c -o libksd.dylib
julia check.jl > out.txt

check.jl:

using Distributions

function _cdf(d, x)
    ccall((:KScdf, "libksd"), Float64, (Int, Float64), d.n, x)
end

begin
    ns = vcat([1:50], [100, 200, 500, 1000, 2000, 10_000, 50_000, 100_000, 500_000, 1_000_000])
    xs = linspace(-0.1, 1.1, 200)
    n_all = 0
    n_eq1 = 0
    n_eq2 = 0
    println("n\tx\treference\tcurrent\tproposed\tcurrent_ok\tproposed_ok")
    for n in ns
        ksd = Distributions.KSDist(n)
        for x in xs
            v = _cdf(ksd, x)
            v1 = Distributions.cdf(ksd, x)
            v2 = Distributions.cdf2(ksd, x)
            eq1 = isapprox(v, v1)
            eq2 = isapprox(v, v2)
            @printf "%d\t%12.6e\t%12.6e\t%12.6e\t%12.6e\t%s\t%s\n" n x v v1 v2 (eq1 ? "o" : "x") (eq2 ? "o" : "x")

            n_all += 1
            if eq1
                n_eq1 += 1
            end
            if eq2
                n_eq2 += 1
            end
        end
    end

    println(STDERR, "1: $n_eq1/$n_all")
    println(STDERR, "2: $n_eq2/$n_all")
end

Thank you.

@johnmyleswhite
Copy link
Member

This looks like really nice work. Thanks so much for working on this problem. When you say reference, did you check against output or did you read the source code? If the latter, what's the license of the reference code?

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 5a20d3d on bicycle1885:ksdist into 9b897a2 on JuliaStats:master.

@bicycle1885
Copy link
Contributor Author

The reference is a C code, so I wrote a thin wrapper in Julia to get the reference values.
The final pull request will not contain the reference code, but I'm not sure whether the license (GNU GPL v3) of the reference program propagates to my translated program

@johnmyleswhite
Copy link
Member

If you translated the C to Julia, then the license propagates. If you just called the C code, the license will not propagate.

@bicycle1885
Copy link
Contributor Author

So it would be nice for me to contact the original author and get the permission to distribute my port under the MIT license.

@bicycle1885
Copy link
Contributor Author

I've prepended the license term of GNU GPL v3 to the source code file.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling c9bc19e on bicycle1885:ksdist into 9b897a2 on JuliaStats:master.

@simonbyrne
Copy link
Member

Thanks for looking at this: I was scared off it previously. At the moment we're trying to keep everything in the package under permissive MIT/BSD-style licences. It would be great if you could contact the author for permission to use such a licence for the port.

@bicycle1885
Copy link
Contributor Author

I've sent an email to the author via Gmail but it was rejected by the recipient server...

I'll try later with other email address.

@JLTastet
Copy link

JLTastet commented Jul 4, 2018

@bicycle1885 Sorry for the necrobump, but did you make any progress or got a reply from the author in the past few years ?

I am asking this because I came across a discrepancy when comparing the distribution to Monte-Carlo generated data. I ran a KS test many times for samples generated under the null hypothesis, then histogrammed the test statistics values and the corresponding p-values.

See figures below (for N=5). The PDF was computed from the CDF using centered finite difference.

kolmogorov_smirnov
ks_pvalues

@@ -1,5 +1,37 @@
# This program is a Julia port of KolmogorovSmirnovDist.c, which is
# distributed under the GNU GPL v3 license.
Copy link
Member

@nalimilan nalimilan Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that GPL3 isn't compatible with the license of this package, which is MIT. So I'm afraid that code cannot live in Distributions.jl.

EDIT: I've just realized this had already been mentioned, I thought this was a new PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at the paper to see if there was any pseudo-code which could have been the base for a clean-room implementation, but there does not seem to be anything except for a quick description of the Pomeranz recursion formula.

If this issue becomes a blocker for my current project, I could try to implement it from the primary sources, but I cannot promise anything right now.

I also had a quick look at what was available in other languages:

  • R has ks.test, but it is GPLv3 as well.
  • SciPy is BSD3-licensed and implements scipy.stats.kstest. I don't know if its computation of the CDF is any better, but if it is then we could probably use it here.

@matbesancon
Copy link
Member

hi @bicycle1885, are there some plans for this feature to be integrated? The main problem seems to be the impossibility to port the original code. Some other way this could be done?

@matbesancon matbesancon added the wont-merge PR will not be merged, kept for archiving purpose label Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wont-merge PR will not be merged, kept for archiving purpose
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants