-
Notifications
You must be signed in to change notification settings - Fork 52
Example 5
We will demonstrate the fungus mitogenome assembly using a public real dataset, SRR5803928. It's a WGS dataset of Lactarius trivialis, a common Basidiomycota fungi species.
Let's download 150M bases of reads from Genbank. You have to install sra-tools (https://github.com/ncbi/sra-tools) before you use fastq-dump
.
fastq-dump --origfmt --split-files -X 500000 --gzip SRR5803928
Alternatively, you can download a reduced dataset from GetOrganelleGallery using wget
:
wget https://github.com/Kinggerm/GetOrganelleGallery/blob/master/Test/reads/SRR5803928_1.fastq.gz
wget https://github.com/Kinggerm/GetOrganelleGallery/blob/master/Test/reads/SRR5803928_2.fastq.gz
# use gitlab if above links fail, e.g. https://gitlab.com/Kinggerm/GetOrganelleGallery/-/raw/master/Test/reads/SRR5201683_1.fq.gz
# then check the integrity of downloaded file using md5sum:
md5sum SRR5803928*.fastq.gz
# 34ffbeb14d0b824b98b04929e828d3f9 SRR5803928_1.fastq.gz
# d9228260fc3516de38284b252a26ead7 SRR5803928_2.fastq.gz
# Please re-download the reads if md5 values unmatched
Let's conduct the assembly (Memory: ~2.5G; Duration: ~140 sec):
get_organelle_from_reads.py -1 SRR5803928_1.fastq.gz -2 SRR5803928_2.fastq.gz -F fungus_mt -o SRR5803928-mitogenome
It should finish with main output files in the output directory SRR5803928-mitogenome
:
-
get_org.log.txt the log file, click to expand
GetOrganelle v1.7.0-beta5 # file names changed according to 1.7.4-pre
get_organelle_from_reads.py assembles organelle genomes from genome skimming data.
Find updates in https://github.com/Kinggerm/GetOrganelle and see README.md for more information.
Python 3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54) [GCC 7.3.0]
PYTHON LIBS: numpy 1.18.1; sympy 1.5.1; scipy 1.4.1; psutil 5.7.0
DEPENDENCIES: Bowtie2 2.3.5.1; SPAdes 3.12.0; Blast 2.9.0; Bandage 0.8.1
SEED DB: fungus_mt 0.0.0
LABEL DB: fungus_mt 0.0.0
WORKING DIR: /home/data1
/root/.pyenv/versions/miniconda3-4.3.30/bin/get_organelle_from_reads.py -1 SRR5803928_1.fastq.gz -2 SRR5803928_2.fastq.gz -F fungus_mt -o SRR5803928-mitogenome
2020-06-05 05:02:38,138 - INFO: Pre-reading fastq ...
2020-06-05 05:02:38,138 - INFO: Estimating reads to use ... (to use all reads, set '--reduce-reads-for-coverage inf')
2020-06-05 05:02:38,882 - INFO: Estimating reads to use finished.
2020-06-05 05:02:38,882 - INFO: Unzipping reads file: SRR5803928_1.fastq.gz (41478203 bytes)
2020-06-05 05:02:39,681 - INFO: Unzipping reads file: SRR5803928_2.fastq.gz (42614236 bytes)
2020-06-05 05:02:40,518 - INFO: Counting read qualities ...
2020-06-05 05:02:40,656 - INFO: Identified quality encoding format = Illumina 1.8+
2020-06-05 05:02:40,656 - WARNING: Max quality score 'K'(75:42) in your fastq file exceeds the usual range (33, 74)
2020-06-05 05:02:40,657 - INFO: Trimming bases with qualities (0.00%): 33..33 !
2020-06-05 05:02:40,716 - INFO: Mean error rate = 0.0084
2020-06-05 05:02:40,717 - INFO: Counting read lengths ...
2020-06-05 05:02:41,936 - INFO: Mean = 150.0 bp, maximum = 150 bp.
2020-06-05 05:02:41,937 - INFO: Reads used = 500000+500000
2020-06-05 05:02:41,937 - INFO: Pre-reading fastq finished.
2020-06-05 05:02:41,937 - INFO: Making seed reads ...
2020-06-05 05:02:41,938 - INFO: Seed bowtie2 index existed!
2020-06-05 05:02:41,938 - INFO: Mapping reads to seed bowtie2 index ...
2020-06-05 05:03:01,648 - INFO: Mapping finished.
2020-06-05 05:03:01,648 - INFO: Seed reads made: SRR5803928-mitogenome/seed/fungus_mt.initial.fq (9946358 bytes)
2020-06-05 05:03:01,648 - INFO: Making seed reads finished.
2020-06-05 05:03:01,648 - INFO: Checking seed reads and parameters ...
2020-06-05 05:03:01,649 - INFO: The automatically-estimated parameter(s) do not ensure the best choice(s).
2020-06-05 05:03:01,649 - INFO: If the result graph is not a circular organelle genome,
2020-06-05 05:03:01,649 - INFO: you could adjust the value(s) of '-w'/'-R' for another new run.
2020-06-05 05:03:03,693 - INFO: Pre-assembling mapped reads ...
2020-06-05 05:03:08,260 - INFO: Pre-assembling mapped reads finished.
2020-06-05 05:03:08,261 - INFO: Estimated fungus_mt-hitting base-coverage = 209.44
2020-06-05 05:03:08,261 - INFO: Estimated word size(s): 92
2020-06-05 05:03:08,261 - INFO: Setting '-w 92'
2020-06-05 05:03:08,262 - INFO: Setting '--max-extending-len inf'
2020-06-05 05:03:08,315 - INFO: Checking seed reads and parameters finished.
2020-06-05 05:03:08,315 - INFO: Making read index ...
2020-06-05 05:03:14,750 - INFO: Mem 0.494 G, 871458 candidates in all 1000000 reads
2020-06-05 05:03:14,752 - INFO: Pre-grouping reads ...
2020-06-05 05:03:14,752 - INFO: Setting '--pre-w 92'
2020-06-05 05:03:14,834 - INFO: Mem 0.456 G, 101017/101017 used/duplicated
2020-06-05 05:03:25,807 - INFO: Mem 2.383 G, 7195 groups made.
2020-06-05 05:03:25,867 - INFO: Making read index finished.
2020-06-05 05:03:25,867 - INFO: Extending ...
2020-06-05 05:03:25,867 - INFO: Adding initial words ...
2020-06-05 05:03:26,691 - INFO: AW 464340
2020-06-05 05:03:33,315 - INFO: Round 1: 871458/871458 AI 48129 AW 849994 Mem 1.565
2020-06-05 05:03:38,938 - INFO: Round 2: 871458/871458 AI 49213 AW 885114 Mem 1.565
2020-06-05 05:03:44,642 - INFO: Round 3: 871458/871458 AI 49219 AW 885334 Mem 1.565
2020-06-05 05:03:50,276 - INFO: Round 4: 871458/871458 AI 49219 AW 885334 Mem 1.565
2020-06-05 05:03:50,276 - INFO: No more reads found and terminated ...
2020-06-05 05:03:51,827 - INFO: Extending finished.
2020-06-05 05:03:51,857 - INFO: Separating extended fastq file ...
2020-06-05 05:03:52,134 - INFO: Setting '-k 21,55,85,115'
2020-06-05 05:03:52,134 - INFO: Assembling using SPAdes ...
2020-06-05 05:03:52,151 - INFO: spades.py -t 1 -1 SRR5803928-mitogenome/extended_1_paired.fq -2 SRR5803928-mitogenome/extended_2_paired.fq --s1 SRR5803928-mitogenome/extended_1_unpaired.fq --s2 SRR5803928-mitogenome/extended_2_unpaired.fq -k 21,55,85,115 -o SRR5803928-mitogenome/extended_spades
2020-06-05 05:04:51,891 - INFO: Insert size = 362.691, deviation = 14.9387, left quantile = 344, right quantile = 382
2020-06-05 05:04:51,891 - INFO: Assembling finished.
2020-06-05 05:04:53,119 - INFO: Slimming SRR5803928-mitogenome/extended_spades/K115/assembly_graph.fastg finished!
2020-06-05 05:04:54,110 - INFO: Slimming SRR5803928-mitogenome/extended_spades/K85/assembly_graph.fastg finished!
2020-06-05 05:04:55,148 - INFO: Slimming SRR5803928-mitogenome/extended_spades/K55/assembly_graph.fastg finished!
2020-06-05 05:04:55,148 - INFO: Slimming assembly graphs finished.
2020-06-05 05:04:55,148 - INFO: Extracting fungus_mt from the assemblies ...
2020-06-05 05:04:55,149 - INFO: Disentangling SRR5803928-mitogenome/extended_spades/K115/assembly_graph.fastg.extend-fungus_mt.fastg as a circular genome ...
2020-06-05 05:04:55,154 - INFO: Average fungus_mt kmer-coverage = 67.3
2020-06-05 05:04:55,154 - INFO: Average fungus_mt base-coverage = 280.3
2020-06-05 05:04:55,154 - INFO: Writing output ...
2020-06-05 05:04:55,173 - INFO: Writing PATH1 of complete fungus_mt to SRR5803928-mitogenome/fungus_mt.K115.complete.graph1.1.path_sequence.fasta
2020-06-05 05:04:55,173 - INFO: Writing GRAPH to SRR5803928-mitogenome/fungus_mt.K115.complete.graph1.selected_graph.gfa
2020-06-05 05:04:55,298 - INFO: Writing GRAPH image to SRR5803928-mitogenome/fungus_mt.K115.complete.graph1.selected_graph.png
2020-06-05 05:04:55,299 - INFO: Result status of fungus_mt: circular genome
2020-06-05 05:04:55,346 - INFO: Please check the produced assembly image or manually visualize SRR5803928-mitogenome/extended_K115.assembly_graph.fastg.extend-fungus_mt.fastg using Bandage to confirm the final result.
2020-06-05 05:04:55,347 - INFO: Writing output finished.
2020-06-05 05:04:55,347 - INFO: Extracting fungus_mt from the assemblies finished.
Total cost 137.70 s
Thank you!
-
extended_K115.assembly_graph.fastg
-
extended_K115.assembly_graph.fastg.extend-fungus_mt.fastg
-
extended_K115.assembly_graph.fastg.extend-fungus_mt.csv
EDGE database loci loci_gene_sequential loci_sequential details 34718 fungus_mt ai4,ai5,ai7,atp6,atp8,atp9,bi2,bi3,bi4,cob,cox1,cox2,cox3,cytb,nad1,nad2,nad3,nad4,nad4L,nad5,nad6,rrnL,rrnS cox3>>atp9>>nad4>>cox1>>ai7>>ai5>>ai4>>cox1>>atp6>>nad3>>nad2>>rrnS>>nad1>>nad6>>rrnL>>cob>>cytb>>bi4>>bi3>>bi2>>cox2>>atp8>>nad5>>nad4L>>nad4>>34718 cox3>>atp9>>nad4>>cox1>>ai7>>ai5>>ai4>>cox1>>atp6>>nad3>>nad2>>rrnS>>nad1>>nad6>>rrnL>>cob>>cytb>>bi4>>bi3>>bi2>>cox2>>atp8>>nad5>>nad4L>>nad4>>34718 cox3(170-978,fungus_mt)>>atp9(2991-3212,fungus_mt)>>nad4(5057-6508,fungus_mt)>>cox1(7597-8905,fungus_mt)>>ai7(8079-8905,fungus_mt)>>ai5(8228-8905,fungus_mt)>>ai4(8323-8905,fungus_mt)>>cox1(11551-11829,fungus_mt)>>atp6(12338-13108,fungus_mt)>>nad3(17498-17872,fungus_mt)>>nad2(17872-19521,fungus_mt)>>rrnS(20417-22232,fungus_mt)>>nad1(22984-24000,fungus_mt)>>nad6(25463-26074,fungus_mt)>>rrnL(29778-32912,fungus_mt)>>cob(36281-37435,fungus_mt)>>cytb(36551-37411,fungus_mt)>>bi4(36689-37355,fungus_mt)>>bi3(36928-37355,fungus_mt)>>bi2(37066-37355,fungus_mt)>>cox2(38347-39095,fungus_mt)>>atp8(39561-39713,fungus_mt)>>nad5(39970-41964,fungus_mt)>>nad4L(41970-42236,fungus_mt)>>nad4(41985-42236,fungus_mt)>>34718
note that the loci information is based on blast-hit rather than thorough annotation.
-
fungus_mt.K115.complete.graph1.1.path_sequence.fasta assembled genome sequence
>34718+(circular) ATTATTTATAAAGAAATTAATAAACAAAAATTAAAAAGGGATAACAATT .. (42,366 bp)
-
fungus_mt.K115.complete.graph1.selected_graph.gfa assembly graph
S 34718 AAAAAAAAAAAAACATATAAAATATTACTTGCTAATTACA .. (42,481 bp, including a 115-bp overlap) L 34718 - 34718 - 115M
-
fungus_mt.K115.complete.graph1.selected_graph.png
This file will be generated when
Bandage
was added to the $PATH (optional). Bandage-visualized
Besides, you will find lots of temporary files that you can delete them after a successful assembly
- seed subfolder of seed reads and parameter estimation temp files
- fungus_mt.initial.sam mapping alignment
- fungus_mt.initial.fq seed reads used for read extending
- fungus_mt.initial.fq.spades used for parameter estimation, draft assembly
- extended_1_paired.fq target organelle associated reads - paired forward
- extended_1_unpaired.fq target organelle associated reads - unpaired forward
- extended_2_paired.fq target organelle associated reads - paired reverse
- extended_2_unpaired.fq target organelle associated reads - unpaired reverse
- extended_spades subfolder of SPAdes assemblies
- scaffolds.paths
- assembly_graph_with_scaffolds.gfa
- assembly_graph.fastg
- spades.log**
- K115
- assembly_graph.fastg.extend-fungus_mt.fastg
- assembly_graph.fastg.extend-fungus_mt.csv
- (other files)
- K85
- assembly_graph.fastg.extend-fungus_mt.fastg
- assembly_graph.fastg.extend-fungus_mt.csv
- (other files)
- K55
- assembly_graph.fastg.extend-fungus_mt.fastg
- assembly_graph.fastg.extend-fungus_mt.csv
- (other files)
- (other files) K21, input_dataset.yaml, contigs.fasta, scaffolds.fasta, etc.