Skip to content

Example 5

Kinggerm edited this page Apr 6, 2021 · 18 revisions

We will demonstrate the fungus mitogenome assembly using a public real dataset, SRR5803928. It's a WGS dataset of Lactarius trivialis, a common Basidiomycota fungi species.

Let's download 150M bases of reads from Genbank. You have to install sra-tools (https://github.com/ncbi/sra-tools) before you use fastq-dump.

fastq-dump --origfmt --split-files -X 500000 --gzip SRR5803928

Alternatively, you can download a reduced dataset from GetOrganelleGallery using wget:

wget https://github.com/Kinggerm/GetOrganelleGallery/blob/master/Test/reads/SRR5803928_1.fastq.gz
wget https://github.com/Kinggerm/GetOrganelleGallery/blob/master/Test/reads/SRR5803928_2.fastq.gz
# use gitlab if above links fail, e.g. https://gitlab.com/Kinggerm/GetOrganelleGallery/-/raw/master/Test/reads/SRR5201683_1.fq.gz

# then check the integrity of downloaded file using md5sum:
md5sum SRR5803928*.fastq.gz
# 34ffbeb14d0b824b98b04929e828d3f9  SRR5803928_1.fastq.gz
# d9228260fc3516de38284b252a26ead7  SRR5803928_2.fastq.gz
# Please re-download the reads if md5 values unmatched

Run

Let's conduct the assembly (Memory: ~2.5G; Duration: ~140 sec):

get_organelle_from_reads.py -1 SRR5803928_1.fastq.gz -2 SRR5803928_2.fastq.gz -F fungus_mt -o SRR5803928-mitogenome

Output

It should finish with main output files in the output directory SRR5803928-mitogenome:

  • get_org.log.txt      the log file, click to expand

    GetOrganelle v1.7.0-beta5 # file names changed according to 1.7.4-pre

    get_organelle_from_reads.py assembles organelle genomes from genome skimming data.
    Find updates in https://github.com/Kinggerm/GetOrganelle and see README.md for more information.

    Python 3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54) [GCC 7.3.0]
    PYTHON LIBS: numpy 1.18.1; sympy 1.5.1; scipy 1.4.1; psutil 5.7.0
    DEPENDENCIES: Bowtie2 2.3.5.1; SPAdes 3.12.0; Blast 2.9.0; Bandage 0.8.1
    SEED DB: fungus_mt 0.0.0
    LABEL DB: fungus_mt 0.0.0
    WORKING DIR: /home/data1
    /root/.pyenv/versions/miniconda3-4.3.30/bin/get_organelle_from_reads.py -1 SRR5803928_1.fastq.gz -2 SRR5803928_2.fastq.gz -F fungus_mt -o SRR5803928-mitogenome

    2020-06-05 05:02:38,138 - INFO: Pre-reading fastq ...
    2020-06-05 05:02:38,138 - INFO: Estimating reads to use ... (to use all reads, set '--reduce-reads-for-coverage inf')
    2020-06-05 05:02:38,882 - INFO: Estimating reads to use finished.
    2020-06-05 05:02:38,882 - INFO: Unzipping reads file: SRR5803928_1.fastq.gz (41478203 bytes)
    2020-06-05 05:02:39,681 - INFO: Unzipping reads file: SRR5803928_2.fastq.gz (42614236 bytes)
    2020-06-05 05:02:40,518 - INFO: Counting read qualities ...
    2020-06-05 05:02:40,656 - INFO: Identified quality encoding format = Illumina 1.8+
    2020-06-05 05:02:40,656 - WARNING: Max quality score 'K'(75:42) in your fastq file exceeds the usual range (33, 74)
    2020-06-05 05:02:40,657 - INFO: Trimming bases with qualities (0.00%): 33..33 !
    2020-06-05 05:02:40,716 - INFO: Mean error rate = 0.0084
    2020-06-05 05:02:40,717 - INFO: Counting read lengths ...
    2020-06-05 05:02:41,936 - INFO: Mean = 150.0 bp, maximum = 150 bp.
    2020-06-05 05:02:41,937 - INFO: Reads used = 500000+500000
    2020-06-05 05:02:41,937 - INFO: Pre-reading fastq finished.

    2020-06-05 05:02:41,937 - INFO: Making seed reads ...
    2020-06-05 05:02:41,938 - INFO: Seed bowtie2 index existed!
    2020-06-05 05:02:41,938 - INFO: Mapping reads to seed bowtie2 index ...
    2020-06-05 05:03:01,648 - INFO: Mapping finished.
    2020-06-05 05:03:01,648 - INFO: Seed reads made: SRR5803928-mitogenome/seed/fungus_mt.initial.fq (9946358 bytes)
    2020-06-05 05:03:01,648 - INFO: Making seed reads finished.

    2020-06-05 05:03:01,648 - INFO: Checking seed reads and parameters ...
    2020-06-05 05:03:01,649 - INFO: The automatically-estimated parameter(s) do not ensure the best choice(s).
    2020-06-05 05:03:01,649 - INFO: If the result graph is not a circular organelle genome,
    2020-06-05 05:03:01,649 - INFO: you could adjust the value(s) of '-w'/'-R' for another new run.
    2020-06-05 05:03:03,693 - INFO: Pre-assembling mapped reads ...
    2020-06-05 05:03:08,260 - INFO: Pre-assembling mapped reads finished.
    2020-06-05 05:03:08,261 - INFO: Estimated fungus_mt-hitting base-coverage = 209.44
    2020-06-05 05:03:08,261 - INFO: Estimated word size(s): 92
    2020-06-05 05:03:08,261 - INFO: Setting '-w 92'
    2020-06-05 05:03:08,262 - INFO: Setting '--max-extending-len inf'
    2020-06-05 05:03:08,315 - INFO: Checking seed reads and parameters finished.

    2020-06-05 05:03:08,315 - INFO: Making read index ...
    2020-06-05 05:03:14,750 - INFO: Mem 0.494 G, 871458 candidates in all 1000000 reads
    2020-06-05 05:03:14,752 - INFO: Pre-grouping reads ...
    2020-06-05 05:03:14,752 - INFO: Setting '--pre-w 92'
    2020-06-05 05:03:14,834 - INFO: Mem 0.456 G, 101017/101017 used/duplicated
    2020-06-05 05:03:25,807 - INFO: Mem 2.383 G, 7195 groups made.
    2020-06-05 05:03:25,867 - INFO: Making read index finished.

    2020-06-05 05:03:25,867 - INFO: Extending ...
    2020-06-05 05:03:25,867 - INFO: Adding initial words ...
    2020-06-05 05:03:26,691 - INFO: AW 464340
    2020-06-05 05:03:33,315 - INFO: Round 1: 871458/871458 AI 48129 AW 849994 Mem 1.565
    2020-06-05 05:03:38,938 - INFO: Round 2: 871458/871458 AI 49213 AW 885114 Mem 1.565
    2020-06-05 05:03:44,642 - INFO: Round 3: 871458/871458 AI 49219 AW 885334 Mem 1.565
    2020-06-05 05:03:50,276 - INFO: Round 4: 871458/871458 AI 49219 AW 885334 Mem 1.565
    2020-06-05 05:03:50,276 - INFO: No more reads found and terminated ...
    2020-06-05 05:03:51,827 - INFO: Extending finished.

    2020-06-05 05:03:51,857 - INFO: Separating extended fastq file ...
    2020-06-05 05:03:52,134 - INFO: Setting '-k 21,55,85,115'
    2020-06-05 05:03:52,134 - INFO: Assembling using SPAdes ...
    2020-06-05 05:03:52,151 - INFO: spades.py -t 1 -1 SRR5803928-mitogenome/extended_1_paired.fq -2 SRR5803928-mitogenome/extended_2_paired.fq --s1 SRR5803928-mitogenome/extended_1_unpaired.fq --s2 SRR5803928-mitogenome/extended_2_unpaired.fq -k 21,55,85,115 -o SRR5803928-mitogenome/extended_spades
    2020-06-05 05:04:51,891 - INFO: Insert size = 362.691, deviation = 14.9387, left quantile = 344, right quantile = 382
    2020-06-05 05:04:51,891 - INFO: Assembling finished.

    2020-06-05 05:04:53,119 - INFO: Slimming SRR5803928-mitogenome/extended_spades/K115/assembly_graph.fastg finished!
    2020-06-05 05:04:54,110 - INFO: Slimming SRR5803928-mitogenome/extended_spades/K85/assembly_graph.fastg finished!
    2020-06-05 05:04:55,148 - INFO: Slimming SRR5803928-mitogenome/extended_spades/K55/assembly_graph.fastg finished!
    2020-06-05 05:04:55,148 - INFO: Slimming assembly graphs finished.

    2020-06-05 05:04:55,148 - INFO: Extracting fungus_mt from the assemblies ...
    2020-06-05 05:04:55,149 - INFO: Disentangling SRR5803928-mitogenome/extended_spades/K115/assembly_graph.fastg.extend-fungus_mt.fastg as a circular genome ...
    2020-06-05 05:04:55,154 - INFO: Average fungus_mt kmer-coverage = 67.3
    2020-06-05 05:04:55,154 - INFO: Average fungus_mt base-coverage = 280.3
    2020-06-05 05:04:55,154 - INFO: Writing output ...
    2020-06-05 05:04:55,173 - INFO: Writing PATH1 of complete fungus_mt to SRR5803928-mitogenome/fungus_mt.K115.complete.graph1.1.path_sequence.fasta
    2020-06-05 05:04:55,173 - INFO: Writing GRAPH to SRR5803928-mitogenome/fungus_mt.K115.complete.graph1.selected_graph.gfa
    2020-06-05 05:04:55,298 - INFO: Writing GRAPH image to SRR5803928-mitogenome/fungus_mt.K115.complete.graph1.selected_graph.png
    2020-06-05 05:04:55,299 - INFO: Result status of fungus_mt: circular genome
    2020-06-05 05:04:55,346 - INFO: Please check the produced assembly image or manually visualize SRR5803928-mitogenome/extended_K115.assembly_graph.fastg.extend-fungus_mt.fastg using Bandage to confirm the final result.
    2020-06-05 05:04:55,347 - INFO: Writing output finished.
    2020-06-05 05:04:55,347 - INFO: Extracting fungus_mt from the assemblies finished.


    Total cost 137.70 s
    Thank you!
  • extended_K115.assembly_graph.fastg     

  • extended_K115.assembly_graph.fastg.extend-fungus_mt.fastg     

  • extended_K115.assembly_graph.fastg.extend-fungus_mt.csv     

    EDGE	database	loci	loci_gene_sequential	loci_sequential	details
    34718	fungus_mt	ai4,ai5,ai7,atp6,atp8,atp9,bi2,bi3,bi4,cob,cox1,cox2,cox3,cytb,nad1,nad2,nad3,nad4,nad4L,nad5,nad6,rrnL,rrnS	cox3>>atp9>>nad4>>cox1>>ai7>>ai5>>ai4>>cox1>>atp6>>nad3>>nad2>>rrnS>>nad1>>nad6>>rrnL>>cob>>cytb>>bi4>>bi3>>bi2>>cox2>>atp8>>nad5>>nad4L>>nad4>>34718	cox3>>atp9>>nad4>>cox1>>ai7>>ai5>>ai4>>cox1>>atp6>>nad3>>nad2>>rrnS>>nad1>>nad6>>rrnL>>cob>>cytb>>bi4>>bi3>>bi2>>cox2>>atp8>>nad5>>nad4L>>nad4>>34718	cox3(170-978,fungus_mt)>>atp9(2991-3212,fungus_mt)>>nad4(5057-6508,fungus_mt)>>cox1(7597-8905,fungus_mt)>>ai7(8079-8905,fungus_mt)>>ai5(8228-8905,fungus_mt)>>ai4(8323-8905,fungus_mt)>>cox1(11551-11829,fungus_mt)>>atp6(12338-13108,fungus_mt)>>nad3(17498-17872,fungus_mt)>>nad2(17872-19521,fungus_mt)>>rrnS(20417-22232,fungus_mt)>>nad1(22984-24000,fungus_mt)>>nad6(25463-26074,fungus_mt)>>rrnL(29778-32912,fungus_mt)>>cob(36281-37435,fungus_mt)>>cytb(36551-37411,fungus_mt)>>bi4(36689-37355,fungus_mt)>>bi3(36928-37355,fungus_mt)>>bi2(37066-37355,fungus_mt)>>cox2(38347-39095,fungus_mt)>>atp8(39561-39713,fungus_mt)>>nad5(39970-41964,fungus_mt)>>nad4L(41970-42236,fungus_mt)>>nad4(41985-42236,fungus_mt)>>34718
    

    note that the loci information is based on blast-hit rather than thorough annotation.

  • fungus_mt.K115.complete.graph1.1.path_sequence.fasta      assembled genome sequence

    >34718+(circular)
    ATTATTTATAAAGAAATTAATAAACAAAAATTAAAAAGGGATAACAATT .. (42,366 bp)
    
  • fungus_mt.K115.complete.graph1.selected_graph.gfa      assembly graph

    S	34718	AAAAAAAAAAAAACATATAAAATATTACTTGCTAATTACA .. (42,481 bp, including a 115-bp overlap)
    L	34718	-	34718	-	115M
    
  • fungus_mt.K115.complete.graph1.selected_graph.png     

    This file will be generated when Bandage was added to the $PATH (optional). Bandage-visualized

Besides, you will find lots of temporary files that you can delete them after a successful assembly
  • seed      subfolder of seed reads and parameter estimation temp files
    • fungus_mt.initial.sam      mapping alignment
    • fungus_mt.initial.fq      seed reads used for read extending
    • fungus_mt.initial.fq.spades      used for parameter estimation, draft assembly
  • extended_1_paired.fq      target organelle associated reads - paired forward
  • extended_1_unpaired.fq      target organelle associated reads - unpaired forward
  • extended_2_paired.fq      target organelle associated reads - paired reverse
  • extended_2_unpaired.fq      target organelle associated reads - unpaired reverse
  • extended_spades      subfolder of SPAdes assemblies
    • scaffolds.paths
    • assembly_graph_with_scaffolds.gfa
    • assembly_graph.fastg
    • spades.log**
    • K115
      • assembly_graph.fastg.extend-fungus_mt.fastg
      • assembly_graph.fastg.extend-fungus_mt.csv
      • (other files)
    • K85
      • assembly_graph.fastg.extend-fungus_mt.fastg
      • assembly_graph.fastg.extend-fungus_mt.csv
      • (other files)
    • K55
      • assembly_graph.fastg.extend-fungus_mt.fastg
      • assembly_graph.fastg.extend-fungus_mt.csv
      • (other files)
    • (other files)      K21, input_dataset.yaml, contigs.fasta, scaffolds.fasta, etc.