Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a better error message when the reference is too long for indexing #1651

Open
lbergelson opened this issue Feb 13, 2023 · 6 comments
Open
Labels
long reference Bugs and issues related to long reference sequences.

Comments

@lbergelson
Copy link
Member

There have been two seperate issues this week with people running into a confusing indexing error when using a long reference. We should give a better error. We should ideally also identify this problem upfront instead of crashing late.

See broadinstitute/gatk#8192 and this gatk forum post

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770
        at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:142)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeIndex(TabixIndexCreator.java:129)
        at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:177)
        at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:233)
        at org.broadinstitute.hellbender.utils.variant.writers.GVCFWriter.close(GVCFWriter.java:71)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.closeTool(HaplotypeCaller.java:277)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1101)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
@hjt1129
Copy link

hjt1129 commented Feb 14, 2023

hi, we have also meet this problem. So one question is, if we split the large chromosome in reference genome, the final location information of SNP on splitted chromosome will be different with the gff annotation file of previous vision reference genome?

@lbergelson
Copy link
Member Author

@hjt1129 Yes, it's not a perfect solution. It definitely complicates annotation and comparison. You would need to either reverse the splitting before comparing with the gff or perform the same split on the gff in order to compare.

We'd like to support CSI for vcf but we haven't had the time to do so.

@jd3234
Copy link

jd3234 commented May 17, 2023

I have also run into this problem trying to use VariantQC for a VCF based on a wheat genome. These very large plant genomes require csi indices. Any chance to support this from your side? Plant world is catching up with sequencing ;)

@WimSpee
Copy link

WimSpee commented Aug 16, 2023

@lbergelson The --create-output-variant-index false workaround unfortunately does not work.
Still same Index 32770 out of bounds for length 32770 error.

This seems to be because GATK v4.4.0.0 HaplotypeCaller requires a BAI index file next to the input CRAM file.
When no CRAI index file is next to the input CRAM file this error is given.

org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=285212672
***********************************************************************

A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed.
Please index all input files:

BAI index is not possible because of the chromosome sizes.

I am not sure if an index is really needed for the input CRAM file for HaplotypeCaller to do its work.
If it would be possible for HaplotypeCaller to do it work without an input CRAM index, that would be a preferable workaround for us, compared to splitting the chromosomes. Next to of course CRAI support :)

Thank you.
Full command and error

gatk --java-options "-Xmx20g" HaplotypeCaller --create-output-variant-index false -R reference.fa -I input.cram -O output.gvcf.gz -ERC GVCF

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770
        at htsjdk.samtools.CRAMBAIIndexer$CRAMBAIIndexBuilder.processBAIEntry(CRAMBAIIndexer.java:391)
        at htsjdk.samtools.CRAMBAIIndexer$CRAMBAIIndexBuilder.access$200(CRAMBAIIndexer.java:254)
        at htsjdk.samtools.CRAMBAIIndexer.processBAIEntry(CRAMBAIIndexer.java:161)
        at htsjdk.samtools.cram.CRAIIndex.openCraiFileAsBaiStream(CRAIIndex.java:89)
        at htsjdk.samtools.SamIndexes.asBaiSeekableStreamOrNull(SamIndexes.java:94)
        at htsjdk.samtools.CRAMFileReader.initWithStreams(CRAMFileReader.java:203)
        at htsjdk.samtools.CRAMFileReader.<init>(CRAMFileReader.java:194)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:432)
        at htsjdk.samtools.SamReaderFactory.open(SamReaderFactory.java:106)
        at org.broadinstitute.hellbender.engine.ReadsPathDataSource.<init>(ReadsPathDataSource.java:245)
        at org.broadinstitute.hellbender.engine.ReadsPathDataSource.<init>(ReadsPathDataSource.java:181)
        at org.broadinstitute.hellbender.engine.GATKTool.initializeReads(GATKTool.java:453)
        at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:724)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.onStartup(AssemblyRegionWalker.java:79)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:147)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

@lbergelson
Copy link
Member Author

@WimSpee I'm sorry. I gave you bad advice. I thought HaplotypeCaller would run without an index if you didn't give it any -L argument, most GATK tools do. It looks like it does some complicated internal sharding though, which ends up using an index. I don't see any easy work around here. It would be great to add CRAI/CSI support but I don't have any timeline for it.

@WimSpee
Copy link

WimSpee commented Dec 18, 2024

@lbergelson Just to let you know, and other people finding this issue, the workaround you suggested now works in GATK 4.6.1.0.

Adding --create-output-variant-index false now enables GATK 4.6.1.0. HaplotypeCaller to create a gVCF file for a BAM file of a species having 2Gb chromosomes.

Maybe this can also be added to the HaplotypeCaller documentation web page. Since the error still isn't very informative.

GATK 4.6.1.0. GenomicsDbImport (and thus GenotypeGVCFs) can't accept those gVCF files, because they require a TBI index, which can't be created, for those files.

A CSI index can be created for the gVCF files that GATK 4.6.1.0. HaplotypeCaller created (via e.g. bcftools index)

Multi-sample variant calling is then possible e.g. via an HTSLIB based tool like GLnexus.

Thank you , and the (other) GATK developers for making this workaround work for people working on species with 500Mb+ chromosomes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
long reference Bugs and issues related to long reference sequences.
Projects
None yet
Development

No branches or pull requests

4 participants