-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: more accurate KSDist function #254
base: master
Are you sure you want to change the base?
Conversation
This looks like really nice work. Thanks so much for working on this problem. When you say reference, did you check against output or did you read the source code? If the latter, what's the license of the reference code? |
The reference is a C code, so I wrote a thin wrapper in Julia to get the reference values. |
If you translated the C to Julia, then the license propagates. If you just called the C code, the license will not propagate. |
So it would be nice for me to contact the original author and get the permission to distribute my port under the MIT license. |
I've prepended the license term of GNU GPL v3 to the source code file. |
Thanks for looking at this: I was scared off it previously. At the moment we're trying to keep everything in the package under permissive MIT/BSD-style licences. It would be great if you could contact the author for permission to use such a licence for the port. |
I've sent an email to the author via Gmail but it was rejected by the recipient server... I'll try later with other email address. |
@bicycle1885 Sorry for the necrobump, but did you make any progress or got a reply from the author in the past few years ? I am asking this because I came across a discrepancy when comparing the distribution to Monte-Carlo generated data. I ran a KS test many times for samples generated under the null hypothesis, then histogrammed the test statistics values and the corresponding p-values. See figures below (for N=5). The PDF was computed from the CDF using centered finite difference. |
@@ -1,5 +1,37 @@ | |||
# This program is a Julia port of KolmogorovSmirnovDist.c, which is | |||
# distributed under the GNU GPL v3 license. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that GPL3 isn't compatible with the license of this package, which is MIT. So I'm afraid that code cannot live in Distributions.jl.
EDIT: I've just realized this had already been mentioned, I thought this was a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look at the paper to see if there was any pseudo-code which could have been the base for a clean-room implementation, but there does not seem to be anything except for a quick description of the Pomeranz recursion formula.
If this issue becomes a blocker for my current project, I could try to implement it from the primary sources, but I cannot promise anything right now.
I also had a quick look at what was available in other languages:
- R has
ks.test
, but it is GPLv3 as well. - SciPy is BSD3-licensed and implements
scipy.stats.kstest
. I don't know if its computation of the CDF is any better, but if it is then we could probably use it here.
hi @bicycle1885, are there some plans for this feature to be integrated? The main problem seems to be the impossibility to port the original code. Some other way this could be done? |
This is a work-in-progress implementation of Simard and L'Ecuyer (2011) meta-algorithm http://www.jstatsoft.org/v39/i11/paper.
The current commit contains debug code and duplicated functions in order to check the differences between the current and the proposed implementation.
I've used KolmogorovSmirnovDist.c as a reference, which is an implementation of the meta-algorithm. http://simul.iro.umontreal.ca/ksdir/
At some inputs, there seems to be serious dispersion between the current one and the reference.
For example when n = 5 and x = 2.256281e-01 the CDFs are:
In this case, the
cdf_durbin
function is in charge, which is reused in the proposed implementation.For other example when n = 1000000 and x = -3.366834e-02 the CDFs are completely different:
What I'm wondering are:
Here is the complete log of comparison: out.txt
(CAUTION - it's 696KB)
And here is the way for quick checking and reproducing it:
check.jl:
Thank you.