Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java OutOfBoundsException on some genomes #8

Open
BrigidaGallone opened this issue Feb 24, 2023 · 1 comment
Open

Java OutOfBoundsException on some genomes #8

BrigidaGallone opened this issue Feb 24, 2023 · 1 comment

Comments

@BrigidaGallone
Copy link

Hello,

I am using the latest version of the pipeline (v.1.0.3) and I am testing a large group of genomes from ncbi (very variable in quality) using the profile module both with PROT and BUSCO.
For some genomes I got a java exception as follow:
Screenshot 2023-02-24 at 11 45 09

In this case, the genome GCA_017580835.1 had no errors with PRO profiling but the BUSCO profiling did not work.
A few genomes failed (with the same error) with the PRO profiling:
GCA_900068945.1
GCA_900068915.1
GCA_900069095.1
GCA_900068965.1
GCA_900068985.1
GCA_018221805.1
GCA_900068975.1
GCA_900068955.1

What does the error mean and do you have any idea about what is causing it?

Thanks a lot for your help and the amazing pipeline!

Best,
Brigida

@endixk
Copy link
Member

endixk commented Feb 28, 2023

Dear Brigida,

Thank you for reaching out!
Based on the error message you provided, it appears that the result file of the fastBlockSearch run is corrupted or improperly formatted.
You can run the pipeline single-threaded with the --dev option to identify the problematic sub-command.

I attempted to reproduce the error using the assemblies associated with the accession numbers you provided.
However, I was able to successfully run both BUSCO profiling of GCA_017580835.1 and PRO profiling of GCA_900068945.1 without any errors on my system.
If possible, please provide the assembly files that caused the issue for further investigation.

One common feature I could find among these assemblies is that they contain a large number of extremely short DNA contigs.
This caused a significant reduction of computational speed from my system, and may be the cause of the error you reported.
My hypothesis is that rejecting FASTA entries with fewer than a given threshold of base pairs (e.g., 1,000 bps) may resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants