Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incredibly small assembly #54

Closed
wbrewer5 opened this issue Sep 30, 2021 · 8 comments
Closed

Incredibly small assembly #54

wbrewer5 opened this issue Sep 30, 2021 · 8 comments

Comments

@wbrewer5
Copy link

I am trying to assemble a fungal (maybe some bacterial hitchhikers) genome with 12.16 GB zipped Nanopore sequencing and 12.5 GB zipped Illumina reads. The current issue is that my first assembly using Wengan is only 651Kb in length. I will post my code and output below. Which log files are helpful in finding the issue?

wengan.pl -x ontraw
-a M
-s pand2_fwd.fastq.gz,pand2_rev.fastq.gz
-l pandora_clean_nanopore.fastq.gz
-p pandora
-t 8
-g 110
pandora.liger.log

@adigenova
Copy link
Owner

HI,
Since the assembly finished, I think that one possible explanation for the smaller genome size obtained might be some sort of contamination in the short-read data. Have you checked if the assembled contigs are some sort of contamination (smaller genome)? Can you estimate the genome size with genomescope2.0 to have an idea if the Illumina reads are contaminated?

Best,
Alex

@wbrewer5
Copy link
Author

wbrewer5 commented Oct 5, 2021

HI, Since the assembly finished, I think that one possible explanation for the smaller genome size obtained might be some sort of contamination in the short-read data. Have you checked if the assembled contigs are some sort of contamination (smaller genome)? Can you estimate the genome size with genomescope2.0 to have an idea if the Illumina reads are contaminated?

Best, Alex

The largest contig is 94Kb and the next largest is in the 20Kb range. There are bacteria inside the fungal sample, but I expect at least 3Mb for those genomes. genomescope2.0 estimates 24Mb, which corresponds to my haslr assembly. I have not used this genome size estimator before so I will include the output for the illumina forward reads to see what you think.

@wbrewer5
Copy link
Author

wbrewer5 commented Oct 5, 2021

property min max
Homozygous (aa) 94.7525% 95.8394%
Heterozygous (ab) 4.16056% 5.24754%
Genome Haploid Length 24,171,583 bp 25,342,343 bp
Genome Repeat Length 20,465,289 bp 21,456,534 bp
Genome Unique Length 3,706,294 bp 3,885,809 bp
Model Fit 26.256% 85.5258%
Read Error Rate 8.30334% 8.30334%

@adigenova
Copy link
Owner

Hi,
it seems that the genome size estimation from short reads is correct, can you share the stats of the short-read assembly?
I recommend to run WenganD, which usually generates better assembly results than WenganA and WenganM.
Best,
Alex

@wbrewer5
Copy link
Author

wbrewer5 commented Oct 7, 2021

pandora.minia.41.log
pandora.minia.81.log
pandora.minia.121.log

I do not have access to a computer with enough memory to run WenganD at the moment. Our compute cluster is moving to a new scheduler soon and we are being advised to wait until after the transition before requesting new software.

@adigenova
Copy link
Owner

I see in the logs that minia generated 4.8 Million contigs with a total assembly length of 1.8Gb, from this numbers I can conclude that the minia assembly is extremely fragmented (Average contig length of 400bp), by default Wengan discard contigs shorter than 500bp, moreover contigs larger than 2kb are used to build the assembly backbone. Although you can modify the 2kb parameter(-M 2000), the minimum recommended is 1kb (-M 1000) but I think that will be not enough if you assemble the short reads with minia. Then, most contigs are being discarded due to these length constraints, you end up with a much shorter assembly. My recommendation is to try WenganA or WenganD, as your genome is not large, WenganD might be able to finish in a machine with 50-60Gb RAM.
Best,

Alex

@wbrewer5
Copy link
Author

wbrewer5 commented Nov 3, 2021

I switched to my institution's compute cluster to use WenganD. Are you familiar with this error message?

gzip: stdout: Broken pipe

Below is my submission script.

/lustre/haven/user/wbrewer5/wengan/wengan.pl -x ontraw -a D
-s /lustre/haven/user/wbrewer5/pandora/assembly/fastq/zipped/Pand2_fwd.fastq.gz,/lustre/haven/user/wbrewer5/pandora/assembly/fastq/zipped/Pand2_rev.fastq.gz
-l /lustre/haven/user/wbrewer5/pandora/assembly/fastq/zipped/pandora_clean_nanopore.fastq.gz
-p pandora
-t 24
-g 20

@adigenova
Copy link
Owner

Well, the message is not very informative. Perhaps the job was killed?
Best,
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants