Skip to content

Commit 990ac5a

Browse files
Update README.md
1 parent f46aaac commit 990ac5a

File tree

1 file changed

+20
-16
lines changed

1 file changed

+20
-16
lines changed

README.md

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ mg_classifier assigns a taxonomy to 16S or 18S sequences produced by PCR amplifi
88

99
### Dependencies
1010
All necessary scripts are included in the `scripts/ ` subfolder:
11-
- To cluster and classify sequences it relies on [vsearch](https://github.com/torognes/vsearch).
11+
- To cluster and classify sequences it relies on [vsearch](https://github.com/torognes/vsearch) 2.5.0.
1212
- If you are processing many samples (usual stuff) the script needs another little perl script called compile_classifications.pl.
1313

1414
### Download and install
@@ -20,7 +20,7 @@ The best way to get mg_classifier is to clone this repository directly to your l
2020

2121
The script will download a big file (~445 MB) with the databases into the appropiate folder and uncompress it. It will also make symbolic links (in `/usr/local/bin`) to the necessary scripts.
2222

23-
For the fast processing of data, mg_classifier has to have the database formatted in a certain way.
23+
For the fast processing of data, mg_classifier has to have the database formatted in a certain way and preferably converted to udb-type.
2424

2525
### Database format
2626
The databases have to have the following format:
@@ -29,7 +29,7 @@ The databases have to have the following format:
2929
>accession:domain;phylum;class;order;family;genus;species
3030
agtcgggcttaggtaaaaa
3131
```
32-
We have preformated some databases that are publicly available and can be also downloaded from [figshare](https://figshare.com/account/home#/projects/20254).
32+
We have preformated some databases that are publicly available and can be also downloaded from [figshare](https://figshare.com/account/home#/projects/20254). The databases are both in fasta format and udb for faster performance. A udb file is a database file that contains sequences and a k-mer index for those sequences.
3333

3434
16S rRNA
3535
- [SILVA](10.6084/m9.figshare.4814062) ver. 128 (see original [source](https://www.arb-silva.de))
@@ -39,7 +39,9 @@ We have preformated some databases that are publicly available and can be also d
3939
- [Protist](10.6084/m9.figshare.4814056) PR2 ver. 4.5 (see original [source](https://figshare.com/articles/PR2_rRNA_gene_database/3803709))
4040
- SILVA ver. 128 (only eukaryotic sequences, see original [source](https://www.arb-silva.de))
4141

42-
We recommend getting the [EzBioCloud](http://www.ezbiocloud.net/resources/pipelines) curated database, but since it is not publicly available (although it is free for academia), we cannot distributed it. If you get it, then you´ll have to formatted accordingly. You can use our script [db_reformatter.sh](https://github.com/GenomicaMicrob/db_reformatter).
42+
We recommend getting the [EzBioCloud](http://www.ezbiocloud.net/resources/pipelines) curated database, but since it is not publicly available (although it is free for academia), we cannot distributed it. If you get it, then you´ll have to formatted accordingly. You can use our script [db_reformatter.sh](https://github.com/GenomicaMicrob/db_reformatter). To convert it to an udb file, you'll need vsearch 2.5.0:
43+
44+
`vsearch --makeudb_usearch EzBioCloud_v1.5.fasta`
4345

4446
### Usage
4547
Go to a folder where you have all your clean multifasta files of you samples and type:
@@ -50,18 +52,20 @@ If you want to classify only one or some files, you can type them:
5052

5153
`$ ./mg_classifier.sh file1.fasta file2.fna`
5254

53-
Afterwards, it will present a menu where you can select a database to use. Since it is super fast, you probably don't need to close the terminal, but i case you do, it will continue working as long as you do NOT cancel the process with `Crtl Z`. So, if you want to exit but leave it running, just close the terminal window.
55+
Afterwards, it will present a menu where you can select a database to use. Since it is super fast, you probably don't need to close the terminal, but in case you do, it will continue working as long as you do NOT cancel the process with `Crtl Z`. So, if you want to exit but leave it running, just close the terminal window.
5456

5557
You can get help by typing:
5658

5759
`$ ./mg_classifier.sh -h`
5860

5961
### Output
60-
mg_classifier will produce four files:
62+
mg_classifier will produce six files:
6163
- **otus.tsv** A file containing the taxonomy and numbers of sequences per sample.
6264
- **bacteria.spf** or **eukaryota.spf** A file that can be opened directly by [STAMP](http://kiwi.cs.dal.ca/Software/STAMP); the name of the file depends if 16S or 18S was chosen.
63-
- **OTUs_taxon.tsv** Information of how many hits per taxonomic level per sample.
64-
- **mgclassifier.log** A log file.
65+
- **samples-tax.tsv** Similar to otus.tsv but the sample data are first and the taxonomy in the last column.
66+
- **OTUs_summary.tsv** Information of how many hits per taxonomic level per sample.
67+
- **mg_classifier.report** A report file of the results.
68+
- **mg_classifier.log** A log file.
6569

6670
All files are separated by tab, so they can be also opened with Excel.
6771

@@ -82,22 +86,22 @@ Four samples were classified; the first column has the percentage of similitud o
8286
Threshold values for delimiting a bacterial taxon were taken from [Yarza et al. 2014. Nat Rev Microbiol 12:635-645](http://www.nature.com/nrmicro/journal/v12/n9/full/nrmicro3330.html). For 18S, the species threshold was set to 99.0 %, but other taxon thresholds were kept the same as for bacteria, this still has to be fine-tuned, use it with caution.
8387

8488
### How fast?
85-
22,267 16S V4 sequences (mean length 228.7 bases, 5.18 million bases) in 4 files were classified in a 64 core 128 GB RAM Dell PowerEdge R810 server with the following results:
89+
22,267 16S V4 sequences (mean length 228.7 bases, 5.18 million bases) in 4 files were classified in a 64 core 128 GB RAM Dell PowerEdge R810 server with a SSD with the following results:
8690

87-
| Database | DB size (MB) | Time (min) |
88-
| --- | ---: | ---:|
89-
| EzBioCloud_v1.5 | 96.6 | **0:22** |
90-
| SILVA_v128 | 303.4 | **1:11** |
91-
| RDP_v11.5 | 3,200.0 | **12:11** |
92-
| RDP_v11.5 V3-V4 | 1,066.5 | **3:16** |
91+
| Database | DB size (MB) | Time (min) fasta db | Time (min) udb db |
92+
| --- | ---: | ---:| ---:|
93+
| EzBioCloud_v1.5 | 96.6 | **0:22** | **0:07** |
94+
| SILVA_v128 | 303.4 | **1:11** | **0:11** |
95+
| RDP_v11.5 | 3,200.0 | **12:11** | **2:29** |
96+
| RDP_v11.5 V3-V4 | 1,066.5 | **3:16** | **0:34** |
9397

9498
Time depends on many factors, most notably:
9599
- **Size of the database**.
96100
- Number of CPUs, since each sample (fasta file) will be run in one CPU.
97101
- Sequences per sample.
98102
- Mean size of the sequences.
99103
- Not so the number of samples, as samples are processed simultanously; the number of cores of the server is the limiting factor here.
100-
- A big constrain might be the RAM memory available since it will upload to the memory the database, once per sample. So, if you process 10 samples with the 1 GB RDP_V3-V4 databse, you´ll need about 10 GB of RAM.
104+
- A big constrain might be the RAM memory available since it will upload to the memory the database, once per sample. So, if you process 10 samples with the 1 GB RDP_V3-V4 databse, you´ll need about 10 GB of RAM. This limitation is reduced if you use the udb formatted databases.
101105

102106
### Acknoledgments
103107
This work was partially supported by the CONACYT project CB-2014-01 238458 and by CIAD, A.C.

0 commit comments

Comments
 (0)