-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crumble output is not always sorted #9
Comments
That's a bug then. It's meant to keep track and maintain the sort order. I'll need to investigate. Maybe it's failing to detect the change of chromosome correctly. Does it also happen when not threading? That's likely as the threads are I think only decode and encode and not the algorithm. Is this a public data set? |
I can unfortunately not share the data, since it is not public. But this only happens in rare cases, since I have run hundreds of samples through crumble without issues, and this is only the second time it has happened. I can confirm that this also happens without multithreading. |
There's nothing hugely glaring from the source, but I'm wondering if the chromosome argument at
https://github.com/jkbonfield/crumble/blob/develop/snp_score.c#L1802 should be For the read which is clearly out of position - the first on chr Y - is there anything obvious about that that could trip up the code that tries to maintain sort order? Eg is it the same coordinate as the previous or next sequence in X listed in that BAM file? Are there deep pileups around this point with everything having the same coordinate? I'm assuming given how rare this is that it's a specific corner case that my code isn't coping with. So far I've failed to reproduce it. If I can't, it may help if I can get some fabricated data that seems to trigger the bug, but not knowing your data makes that tricky. Are you amenable to doing some debugging at your end? I'm thinking this is some coordinate based thing, so setting all sequence bases to N and all qualities to e.g. 9, all flags, mqual, tlen, pnext, rnext to 0, all cigars to something like "sequence-length M" and ditching all auxiliary tags should in theory still trigger it. (I could produce a script to do this.) If that still fails, then it's a starting point towards identifying a test case. However such data may still leak something confidential (mapping coords alone give structural info and maybe data on large indels). |
Update: by generating random data and thrashing it I've reproduced the bug, so fix hopefully incoming soon. Thanks for the bug report. Edit: indeed it was that suspicious line I commented on above I think. More random testing ongoing, but looks better. |
Can you please let me know if this (develop branch) also fixes things for you? If it does I'll move this to master and make a new release. For what it's worth, my little test script was this:
I just ran it in a loop and looked for differences in output. |
I am having a hard time compiling this from the repository. Or actually, the compilation seems to be working, but the program just stalls upon execution. The Compiling from your released tar-ball works fine, but I cannot make the compiled binary from either the master or the develop branch work |
Can I ask you to upload a release tar-ball in this thread, so I can check if the bug is fixed? |
Attaching a tarball. Note this is just from "make dist" in develop, but hasn't had the version numbers changed so it'll still report to be 0.8.2. |
Hmm, reopened as you're right. My fix has other consequences! More work to do then. Sorry. |
Yes, it's the same for this tar-ball. It's just looping |
The fix should have been to replace INT_MAX with tid+1 not tid, but I'm checking more thoroughly this time. My apololgies. |
Okay, looking forward to checking the fix |
It'll be a while to do my longer test, but my change is trivial:
|
I've pushed a new commit to develop now. I tested this on a more complete data file (6Gb of data covering all chromosomes) and also tested it once more on my randomly generated SAM files. My apologies for insufficient testing earlier. Can you please verify this latest one fixes the problem for you? |
Everything works just as expected with the new commit. Thank you for excellent support! |
Thanks for the bug report and assistance in validating it. Given there are no new changes imminently incoming for Crumble (and hopefully no new bug reports just around the corner), I made a new release (0.8.3) with this bug fix in. |
Running crumble 0.8.1 with the following command outputs a single record from chrY prematurely. Is there a way to avoid this?
I am running crumble in a scatter-gather fashion, so only the last five chromosomes are included in this chunk. A workaround could therefore be to run each chromosome by itself, but I would like to avoid this, if there is a proper solution.
The text was updated successfully, but these errors were encountered: