Skip to content

Funannotate exits in predict: bedtools intersect error #522

@felipe797

Description

@felipe797

Hi Jon!
First of all, thanks for the pipeline, it's been really useful to me!

So, I ran into this problem in predict, where it's quitting after a bedtools intersect error. From logfile:

[09:39 AM]: Generating protein fasta files from 22,573 EVM models\n [09:39 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).\n [09:39 AM]: CMD ERROR: bedtools intersect -f 0.9 -a\n /bigdata/stajichlab/fmoreira/Metarhizium_acridum/annotate/New_polish/polished/predict_misc/evm.round1.gff3 -b /bigdata/stajichlab/fmoreira/Metarhizium_acridum/annotate/New_polish/polished/predict_misc/repeatmasker.bed [09:39 AM]: (None, b'ERROR: Received illegal bin number -1 from getBin call.\nMaximum values is: 2396745\nThis typically means that your coordinates are\nnegative or too large to represent in the data\nstructure bedtools uses to find intersections.ERROR: Unable to add record to tree.\n')

I'm running the new versions of both funannotate (v1.8.2) and bedtools (v2.29.2).
After some troubleshooting, I found out bedtools intersect has problems with large scaffolds and the recommendation is, according to their readthedocs:

Note

If you are trying to intersect very large files and are having trouble with excessive memory usage, please presort your data by chromosome and then by start position (e.g., sort -k1,1 -k2,2n in.bed > in.sorted.bed for BED files) and then use the -sorted option. This invokes a memory-efficient algorithm designed for large files. This algorithm has been substantially improved in recent (>=2.18.0) releases.

I proceeded to sort both my repeatmasker.bed and evm.round1.gff3 and run bedtools intersect manually on the sorted new files, using the same command funannotate predict did and same error popped up :

bedtools intersect -f 0.9 -a evm.round1.gff3 -b repeatmasker.sorted.bed        

ERROR: Received illegal bin number -1 from getBin call.
Maximum values is: 2396745
This typically means that your coordinates are
negative or too large to represent in the data
structure bedtools uses to find intersections.ERROR: Unable to add record to tree.

But it ran smoothly in the same command but with the flag -sorted with the sorted files:

bedtools intersect -f 0.9 -a evm.round1.gff3 -b repeatmasker.sorted.bed -sorted

Not sure but maybe a suggestion would be conditional statement to check if infiles are sorted by chromosome and start position and if so run the command with -sorted?

in funannotate/library.py line 5656 :

def validate_tRNA(input, genes, gaps, output): cmd = ['bedtools', 'intersect', '-v', '-a', input, '-b', genes] if gaps: cmd.append(gaps) runSubprocess2(cmd, '.', log, output)

Let me know. Meanwhile, can you think of an alternative way to get around this without editing source? Thank you so much in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions