Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
GaetanBenoitDev authored Jul 31, 2024
1 parent 50042c3 commit 377a57a
Showing 1 changed file with 29 additions and 41 deletions.
70 changes: 29 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,86 +73,74 @@ make -j 3
## Usage

```sh
./metaMDBG asm outputDir reads... {OPTIONS}

outputDir # Output dir for contigs and temporary files
reads... # Read filename(s) (separated by space)
-t # Number of cores [3]

Examples:
./metaMDBG asm ./path/to/assemblyDir reads.fastq.gz -t 4 #single-sample assembly
./metaMDBG asm ./path/to/assemblyDir reads_A.fastq.gz reads_B.fastq.gz reads_C.fastq.gz -t 4 #co-assembly
Usage: metaMDBG asm {OPTIONS}

Basic options:
--out-dir Output dir for contigs and temporary files
--in-hifi PacBio HiFi read filename(s) (separated by space)
--in-ont Nanopore R10.4+ read filename(s) (separated by space)
--threads Number of cores [1]

# Nanopore assembly
metaMDBG asm --out-dir ./outputDir/ --in-ont reads.fastq.gz --threads 4
# Hifi assembly
metaMDBG asm --out-dir ./outputDir/ --in-hihi reads.fastq.gz --threads 4
# Multiple sample co-assembly
metaMDBG asm --out-dir ./outputDir/ --in-ont reads_A.fastq.gz reads_B.fastq.gz reads_C.fastq.gz --threads 4
```

MetaMDBG will generate polished contigs in outputDir ("contigs.fasta.gz").

## Input data (PacBio and Nanopore)

- MetaMDBG has been developped and extensively tested using **PacBio HiFi** data.
- MetaMDBG will not work on raw **Nanopore** reads, but error rate is improving quickly, it might work on duplex data in the future. Currently, you have to polish the reads first. For that, you can use [VeChat](https://github.com/HaploKit/vechat) (using Nanopore reads only), or [Ratatosk](https://github.com/DecodeGenetics/Ratatosk) (using Nanopore + Illumina short-reads).

## Contig information
Contig information, such as whether it is circular or not, are contained in contig headers in the resulting assembly file.
Examples:

```sh
>ctg1_13x_l
>ctg112 length=7013 coverage=6 circular=yes
ACGTAGCTTATAGCGAGTATCG...
>ctg2_678x_c
>ctg37 length=1988 coverage=3 circular=no
ATTATTGATTAGGGCTATGCAT...
>ctg3_14x_rc
>ctg82 length=3824 coverage=13 circular=no
AATTCCGGCGGCGTATTATTAC...
```
Headers are composed of 3 fields separated by underscores.
* Field 1: the name of the contig
* Field 2: estimated coverage for this contig (obtained throught read mapping)
* Field 3: can be "l" (linear), "c" (circular) or "rc" (rescued circular)

Long circular contigs are likely to be complete. Rescued circular are likely to be complete, but it is not guarranted so we recommend using validation methods on them.
Headers are composed of several fields seperated by space.
* **ctgID**: the name of the contig
* **length**: the length of the contig in bps
* **coverage**: an estimated read coverage for the contig
* **circular**: whether the contig is circular or no

## Advanced usage

```sh
# Set minimizer length to 16 and use only 0.2% of total k-mers for assembly.
./metaMDBG asm ./outputDir reads.fastq.gz -k 16 -d 0.002

# Stop assembly when reaching a k-mer length of 5000 bps.
./metaMDBG asm ./outputDir reads.fastq.gz -m 5000
```

## Generating an assembly graph

After a successful run of metaMDBG, assembly graph (.gfa) can be generated with the following command.
```sh
./metaMDBG gfa assemblyDir k --contigpath --readpath
metaMDBG gfa --assembly-dir ./assemblyDir/ --k 21 --contigpath --readpath --threads 4
```

Assembly dir must be a metaMDBG output dir (the one containing the contig file "contigs.fasta.gz"). The k parameter correspond to the level of resolution of the graph: lower k values will produce graph with high connectivity but shorter unitigs, while higher k graphs will be more fragmented but with longer unitigs. The two optional parameters --contigpath and --readpath allow to generate the path of contigs and reads in the graph respectivelly.
Assembly dir must be a metaMDBG output dir (the one containing the contig file "contigs.fasta.gz"). The --k parameter correspond to the level of resolution of the graph: lower k values will produce graph with high connectivity but shorter unitigs, while higher k graphs will be more fragmented but with longer unitigs. The two optional parameters --contigpath and --readpath allow to generate the path of contigs and reads in the graph respectivelly.

First, display the available k values and their corresponding sequence length in bps (those sequence length in bps are equivalent to the k-mer size that would be used in a traditional de-Brujin graph).
```sh
./metaMDBG gfa ./assemblyDir 0
metaMDBG gfa --assembly-dir ./assemblyDir/ --k 0
```

Then, choose a k value and produce the graph (optionnaly add parameters --contigpath and/or --readpath).
```sh
./metaMDBG gfa ./assemblyDir 21
metaMDBG gfa --assembly-dir ./assemblyDir/ --k 21
```

MetaMDBG will generate the assembly graph in the GFA format in assemblyDir (e.g. "assemblyGraph_k21_4013bps.gfa").

Note 1) Unitig sequences in the gfa file are not polished, they have the same error rate as in the original reads. Note 2) To generate the unitig sequences, a pass on the original reads that generated the assembly is required, if you have moved the original readsets, you will need to edit the file ./assemblyDir/tmp/input.txt with the new paths.
Note 1) Unitig sequences in the gfa file are not polished, they have the same error rate as in the original reads. Note 2) To generate the unitig sequences, a pass on the original reads that generated the assembly is required, if you have moved the original readsets, you will need to edit the file ./assemblyDir/tmp/input.txt with the new paths. Note 3) In nanopore mode, the read-path are not very accurate because of the high error rate, we recommend using actual aligner instead, such as graphAligner.

## Low-memory contig polisher
MetaMDBG contig polisher can be used on any set of contigs. You may be interested by this standalone tool if you have memory issues with existing correction software. Note that the correction method is the same as [Racon](https://github.com/isovic/racon).
```sh
./metaMDBG polish contigs tmpDir reads...

Examples:
./metaMDBG polish assembly.fasta.gz ./tmpDir reads.fastq.gz -t 4 #Basic usage
./metaMDBG polish assembly.fasta.gz ./tmpDir reads_1.fastq.gz reads_2.fastq.gz -t 4 #Multiple read sets
./metaMDBG polish assembly.fasta.gz ./tmpDir reads_1.fastq.gz reads_2.fastq.gz -t 4 -n 20 #Change maximum read coverage used for correction (here 20x)
```

## Results

Assembly quality and performances on three HiFi PacBio metagenomics samples (using 16 cores).
Expand All @@ -163,7 +151,7 @@ Assembly quality and performances on three HiFi PacBio metagenomics samples (usi
| Anaerobic Digester | ERR10905742 | 64.7 | 13 | 7 | 62 | 130 |
| Sheep rumen | SRR14289618 | 206.4 | 108 | 22 | 266 | 447 |

Near-complete: ≥95% completeness and ≤5% contamination (assessed by checkM). Binning was performed with metabat2.
Near-complete: ≥90% completeness and ≤5% contamination (assessed by checkM). Binning was performed with metabat2.

## License

Expand Down

0 comments on commit 377a57a

Please sign in to comment.