Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crumble output is not always sorted #9

Closed
mhtorp opened this issue Jun 14, 2019 · 16 comments
Closed

Crumble output is not always sorted #9

mhtorp opened this issue Jun 14, 2019 · 16 comments
Labels

Comments

@mhtorp
Copy link

mhtorp commented Jun 14, 2019

Running crumble 0.8.1 with the following command outputs a single record from chrY prematurely. Is there a way to avoid this?

$ crumble -9 -O BAM,nthreads=4 $in.bam $out.bam
$ samtools view $out.bam | cut -f3 | uniq -c
1566620 20
 637787 21
1541659 22
2365879 X
      1 Y
      4 X
   1386 Y

I am running crumble in a scatter-gather fashion, so only the last five chromosomes are included in this chunk. A workaround could therefore be to run each chromosome by itself, but I would like to avoid this, if there is a proper solution.

@jkbonfield
Copy link
Owner

That's a bug then. It's meant to keep track and maintain the sort order. I'll need to investigate. Maybe it's failing to detect the change of chromosome correctly.

Does it also happen when not threading? That's likely as the threads are I think only decode and encode and not the algorithm.

Is this a public data set?

@jkbonfield jkbonfield added the bug label Jun 14, 2019
@mhtorp
Copy link
Author

mhtorp commented Jun 14, 2019

I can unfortunately not share the data, since it is not public. But this only happens in rare cases, since I have run hundreds of samples through crumble without issues, and this is only the second time it has happened.

I can confirm that this also happens without multithreading.

@jkbonfield
Copy link
Owner

jkbonfield commented Jun 14, 2019

There's nothing hugely glaring from the source, but I'm wondering if the chromosome argument at

	if (last_flush_before != left_most
	    && flush_bam_list(&cd, p, b_hist, INT_MAX, left_most, out, header) < 0)
            return -1;

https://github.com/jkbonfield/crumble/blob/develop/snp_score.c#L1802

should be tid instead of INT_MAX.

For the read which is clearly out of position - the first on chr Y - is there anything obvious about that that could trip up the code that tries to maintain sort order? Eg is it the same coordinate as the previous or next sequence in X listed in that BAM file? Are there deep pileups around this point with everything having the same coordinate? I'm assuming given how rare this is that it's a specific corner case that my code isn't coping with.

So far I've failed to reproduce it. If I can't, it may help if I can get some fabricated data that seems to trigger the bug, but not knowing your data makes that tricky. Are you amenable to doing some debugging at your end? I'm thinking this is some coordinate based thing, so setting all sequence bases to N and all qualities to e.g. 9, all flags, mqual, tlen, pnext, rnext to 0, all cigars to something like "sequence-length M" and ditching all auxiliary tags should in theory still trigger it. (I could produce a script to do this.) If that still fails, then it's a starting point towards identifying a test case. However such data may still leak something confidential (mapping coords alone give structural info and maybe data on large indels).

@jkbonfield
Copy link
Owner

jkbonfield commented Jun 17, 2019

Update: by generating random data and thrashing it I've reproduced the bug, so fix hopefully incoming soon. Thanks for the bug report.

Edit: indeed it was that suspicious line I commented on above I think. More random testing ongoing, but looks better.

@jkbonfield
Copy link
Owner

Can you please let me know if this (develop branch) also fixes things for you? If it does I'll move this to master and make a new release.

For what it's worth, my little test script was this:

#!/usr/bin/perl -w

use strict;

srand(shift(@ARGV));

my $len=10;
print "\@HD\n";
print "\@SQ\tSN:Chr1\tLN:$len\n";
print "\@SQ\tSN:Chr2\tLN:$len\n";
print "\@SQ\tSN:Chr3\tLN:$len\n";

for (my $i = 1; $i <= 4; $i++) {
    my $pos = 1;
    for (my $j = 0; $j < 30; $j++) {
	$pos += int(rand()*($len/2-1));
	last if ($pos+1 > $len);
	if ($i == 4) {
	    print "x.$i.$j\t4\t*\t0\t0\t*\t*\t0\t0\tAA\t99\n";
	} else {
	    if (rand() < 0.5) {
		print "x.$i.$j\t4\tChr$i\t$pos\t0\t*\t*\t0\t0\tAA\t99\n";
	    } else {
		print "x.$i.$j\t0\tChr$i\t$pos\t0\t2M\t*\t0\t0\tAA\t99\n";
	    }
	}
    }
}

I just ran it in a loop and looked for differences in output.

@mhtorp
Copy link
Author

mhtorp commented Jun 17, 2019

I am having a hard time compiling this from the repository. Or actually, the compilation seems to be working, but the program just stalls upon execution. The ./crumble -h screen is working fine, but ./crumble -9 -O BAM,nthreads=4 $in.bam $out.bam is hanging in a seemingly infinite loop...

Compiling from your released tar-ball works fine, but I cannot make the compiled binary from either the master or the develop branch work

@mhtorp
Copy link
Author

mhtorp commented Jun 17, 2019

Can I ask you to upload a release tar-ball in this thread, so I can check if the bug is fixed?
Thank you for helping out, and very nice job with identifying the issue so fast!

@jkbonfield
Copy link
Owner

Attaching a tarball. Note this is just from "make dist" in develop, but hasn't had the version numbers changed so it'll still report to be 0.8.2.

crumble-0.8.2+.tar.gz

@jkbonfield jkbonfield reopened this Jun 17, 2019
@jkbonfield
Copy link
Owner

Hmm, reopened as you're right. My fix has other consequences! More work to do then. Sorry.

@mhtorp
Copy link
Author

mhtorp commented Jun 17, 2019

Yes, it's the same for this tar-ball. It's just looping

@jkbonfield
Copy link
Owner

The fix should have been to replace INT_MAX with tid+1 not tid, but I'm checking more thoroughly this time. My apololgies.

@mhtorp
Copy link
Author

mhtorp commented Jun 17, 2019

Okay, looking forward to checking the fix

@jkbonfield
Copy link
Owner

It'll be a while to do my longer test, but my change is trivial:

diff --git a/snp_score.c b/snp_score.c
index 034cde5..b211a0f 100644
--- a/snp_score.c
+++ b/snp_score.c
@@ -1799,7 +1799,7 @@ int transcode(cram_lossy_params *p, samFile *in, samFile *out,
 
        // Flush history (preserving sort order), periodically only.
        if (last_flush_before != left_most
-           && flush_bam_list(&cd, p, b_hist, tid, left_most, out, header) < 0)
+           && flush_bam_list(&cd, p, b_hist, tid+1, left_most, out, header) < 0)
            return -1;
 
        last_flush_before = left_most;

@jkbonfield
Copy link
Owner

I've pushed a new commit to develop now.

I tested this on a more complete data file (6Gb of data covering all chromosomes) and also tested it once more on my randomly generated SAM files.

My apologies for insufficient testing earlier. Can you please verify this latest one fixes the problem for you?

@mhtorp
Copy link
Author

mhtorp commented Jun 18, 2019

Everything works just as expected with the new commit. Thank you for excellent support!

@mhtorp mhtorp closed this as completed Jun 18, 2019
@jkbonfield
Copy link
Owner

Thanks for the bug report and assistance in validating it.

Given there are no new changes imminently incoming for Crumble (and hopefully no new bug reports just around the corner), I made a new release (0.8.3) with this bug fix in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants