Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions docs/docs/tool-reference/taxonomic-classifiers/centrifuge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
**Centrifuge** is a rapid and memory-efficient classifier of DNA sequences from microbial samples.
Centrifuge requires a relatively small genome index (e.g., 4.3 GB for ~4,100 bacterial genomes) and can process a
typical DNA sequencing run within an hour. For more information,
see the tool's [website](https://ccb.jhu.edu/software/centrifuge/) and
[GitHub repo](https://github.com/DaehwanKimLab/centrifuge).

Function Call
=============

```python
tc.centrifuge(
output_path=None,
tool_args="",
database_name="centrifuge_refseq_bacteria_archaea_viral_human",
database_version="1",
read_one=None,
read_two=None,
unpaired=None,
is_async=False,
)
```

Function Arguments
------------------

| Argument | Use in place of: | Description |
|:-------------------|:------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `read_one` | `-1` | (optional) Path(s) to R1 of paired-end read input files. The files can be a local or remote, see [Using Files](../../getting-started/using-files.md). |
| `read_two` | `-2` | (optional) Path(s) to R2 of paired-end read input files. The files can be a local or remote, see [Using Files](../../getting-started/using-files.md). |
| `unpaired` | `-U` | (optional) Path(s) to unpaired input files. The files can be a local or remote, see [Using Files](../../getting-started/using-files.md). |
| `output_path` | output arguments (`-S`, `--report`) | (optional) Path (directory) to where the output files will be downloaded. If omitted, skips download. The files can be a local or remote, see [Using Files](../../getting-started/using-files.md). |
| `tool_args` | all other arguments | (optional) Additional arguments to be passed to Centrifuge. This should be a string of arguments like the command line. See [Supported Additional Arguments](#supported-additional-arguments) for more details. |
| `database_name` | `-x`\* | (optional) Name of database to use for Centrifuge classification. Defaults to `"centrifuge_refseq_bacteria_archaea_viral_human"` (Refseq bacteria / archaea / viral / human). |
| `database_version` | `-x`\* | (optional) Version of database to use for Centrifuge classification. Defaults to `"1"`. |
| `is_async` | | Whether to run a job asynchronously. See [Async Runs](../../feature-reference/async-runs.md) for more. |

*See the [Databases](#databases) section for more details.

Output Files
------------

A Centrifuge run will output these files into `output_path`:

- `centrifuge_output.txt`: Centrifuge output (captured from `stdout`), from the `-S` argument.
- `centrifuge_report.tsv`: Centrifuge report file, from the `--report` argument.

Notes
-----

### Paired-end reads

For each paired-end input, make sure the corresponding read is in the same position in the input list. For example, two
pairs of paired-end files – `one_R1.fastq`, `one_R2.fastq`, `two_R1.fastq`, `two_R2.fastq` – should be passed to
Toolchest as:

```python
tc.centrifuge(
read_one=["one_R1.fastq", "two_R1.fastq"],
read_two=["one_R2.fastq", "two_R2.fastq"],
...
)
```

Tool Versions
=============

Toolchest currently supports version **1.0.4** of Centrifuge.

Databases
=========

Toolchest currently supports the following databases for Bowtie 2:

| `database_name` | `database_version` | Description |
|:-------------------------------------------------------| :----------------- |:-------------------------------------------------------------------|
| `centrifuge_refseq_bacteria_archaea_viral_human` | `1` | RefSeq, bacteria / archaea / viral / human, JHU source<sup>1</sup> |

<sup>1</sup>These database indexes were generated by [the Langmead Lab at Johns Hopkins](https://langmead-lab.org/) and can be found on [the lab's database index page](https://benlangmead.github.io/aws-indexes/centrifuge).

Supported Additional Arguments
==============================

Most additional arguments not related to input, output, or multithreading are supported:
- \-q
- \--qseq
- \-f
- \-r
- \-c
- \-s, \--skip
- \-u, \--upto
- \-5, \--trim5
- \-3, \--trim3
- \--phred33
- \--phred64
- \--int-quals
- \--ignore-quals
- \--nofw
- \--norc
- \--min-hitlen
- \-k
- \--host-taxids
- \--exclude-taxids
- \--out-fmt
- \--tab-fmt-cols
- \-t, \--time
- \--qc-filter
- \--seed
- \--non-deterministic

Additional arguments can be specified under the `tool_args` argument.
1 change: 1 addition & 0 deletions docs/mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ nav:
- AlphaFold: "tool-reference/structure-prediction/alphafold.md"
- Taxonomic Classifiers:
- About Taxanomic Classifiers: "tool-reference/taxonomic-classifiers.md"
- Centrifuge: "tool-reference/taxonomic-classifiers/centrifuge.md"
- Kraken 2: "tool-reference/taxonomic-classifiers/kraken-2.md"
- MetaPhlAn: "tool-reference/taxonomic-classifiers/metaphlan.md"
- Workflows / Meta-Tools:
Expand Down
60 changes: 60 additions & 0 deletions tests/test_centrifuge.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import os
import pytest

from tests.util import hash
import toolchest_client as toolchest

toolchest_api_key = os.environ.get("TOOLCHEST_API_KEY")
if toolchest_api_key:
toolchest.set_key(toolchest_api_key)


@pytest.mark.integration
def test_centrifuge_many_types():
"""
Tests Centrifuge with one pair of paired-end inputs and two single-end inputs.
"""
test_dir = "temp_test_centrifuge/many_types"
os.makedirs(f"./{test_dir}", exist_ok=True)
output_dir_path = f"./{test_dir}"
output_file_path = f"{output_dir_path}/centrifuge_output.txt"
output_report_path = f"{output_dir_path}/centrifuge_report.tsv"

toolchest.centrifuge(
read_one="s3://toolchest-integration-tests/megahit/r3_1.fa",
read_two="s3://toolchest-integration-tests/megahit/r3_2.fa",
unpaired="s3://toolchest-integration-tests/megahit/r4.fa",
tool_args="-f",
output_path=output_dir_path,
)

assert hash.unordered(output_file_path) == 1779279198
assert hash.unordered(output_report_path) == 1100843098


@pytest.mark.integration
def test_centrifuge_multiple_pairs():
"""
Tests Centrifuge with two pairs of paired-end inputs.
"""
test_dir = "temp_test_centrifuge/multiple_pairs"
os.makedirs(f"./{test_dir}", exist_ok=True)
output_dir_path = f"./{test_dir}"
output_file_path = f"{output_dir_path}/centrifuge_output.txt"
output_report_path = f"{output_dir_path}/centrifuge_report.tsv"

toolchest.centrifuge(
read_one=[
"s3://toolchest-integration-tests/sample_r1.fastq.gz",
"s3://toolchest-integration-tests/r1.fastq.gz",
],
read_two=[
"s3://toolchest-integration-tests/sample_r2.fastq.gz",
"s3://toolchest-integration-tests/r2.fastq.gz",
],
output_path=output_dir_path,
volume_size=32,
)

assert hash.unordered(output_report_path) == 1895979303
assert hash.unordered(output_file_path) == 1059786093
9 changes: 4 additions & 5 deletions toolchest_client/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import builtins
from dotenv import load_dotenv, find_dotenv
import functools
import os

# set __version__ module
try:
Expand All @@ -29,7 +28,7 @@
from toolchest_client.api.query import Query
from toolchest_client.api.status import Status, get_status
from toolchest_client.api.urls import get_api_url, set_api_url
from .tools.api import add_database, alphafold, blastn, bowtie2, bracken, cellranger_count, clustalo, demucs,\
diamond_blastp, diamond_blastx, fastqc, humann3, jupyter, kallisto, kraken2, lastal5, lug, megahit, metaphlan, \
python3, rapsearch, rapsearch2, salmon, shi7, shogun_align, shogun_filter, STAR, test, transfer, unicycler, \
update_database
from .tools.api import add_database, alphafold, blastn, bowtie2, bracken, cellranger_count, centrifuge, clustalo, \
demucs, diamond_blastp, diamond_blastx, fastqc, humann3, jupyter, kallisto, kraken2, lastal5, lug, megahit, \
metaphlan, python3, rapsearch, rapsearch2, salmon, shi7, shogun_align, shogun_filter, STAR, test, transfer, \
unicycler, update_database
3 changes: 2 additions & 1 deletion toolchest_client/files/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .general import assert_exists, check_file_size, files_in_path, sanity_check, compress_files_in_path
from .general import assert_exists, check_file_size, files_in_path, sanity_check, compress_files_in_path, \
convert_input_params_to_prefix_mapping
from .merge import concatenate_files, merge_sam_files
from .s3 import assert_accessible_s3, get_s3_file_size, get_params_from_s3_uri, path_is_s3_uri
from .split import open_new_output_file, split_file_by_lines, split_paired_files_by_lines
Expand Down
51 changes: 51 additions & 0 deletions toolchest_client/files/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,3 +136,54 @@ def sanity_check(file_path):
assert_exists(file_path, must_be_file=True)
if os.stat(file_path).st_size <= 5:
raise ValueError(f"File at {file_path} is suspiciously small")


def convert_input_params_to_prefix_mapping(tag_to_param_map):
"""
Parses input parameters in a Toolchest call into:
- a list of all input paths (for uploading)
- a mapping of inputs to their respective prefixes

Example input params map:
{
"-1": ["example_R1.fastq"],
"-2": ["example_R2.fastq"],
"-U": ["example_U.fastq"],
}

Example output list:
[example_R1.fastq, example_R2.fastq, example_U.fastq]

Example output prefix mapping:
{
"example_R1.fastq": {
"prefix": "-1",
"order": 0,
},
"example_R2.fastq": {
"prefix": "-2",
"order": 0,
},
"example_U.fastq": {
"prefix": "-U",
"order": 0,
},
}
"""
input_list = [] # list of all inputs
input_prefix_mapping = {} # map of each input to its respective tag
for tag, param in tag_to_param_map.items():
if isinstance(param, list):
for index, input_file in enumerate(param):
input_list.append(input_file)
input_prefix_mapping[input_file] = {
"prefix": tag,
"order": index,
}
elif isinstance(param, str):
input_list.append(param)
input_prefix_mapping[param] = {
"prefix": tag,
"order": 0,
}
return input_list, input_prefix_mapping
45 changes: 44 additions & 1 deletion toolchest_client/files/tests/test_general.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

import pytest

from .. import assert_exists, check_file_size, files_in_path, sanity_check
from .. import assert_exists, check_file_size, files_in_path, sanity_check, convert_input_params_to_prefix_mapping

THIS_FILE_PATH = os.path.normpath(pathlib.Path(__file__).parent.resolve())

Expand Down Expand Up @@ -48,3 +48,46 @@ def test_exists_but_not_file():
dir_file_path = f"{THIS_FILE_PATH}/data"
with pytest.raises(ValueError):
assert_exists(dir_file_path, must_be_file=True)


def test_generate_prefix_mapping():
tag_to_param_map = {
"-1": ["example1_R1.fastq", "example2_R1.fastq"],
"-2": ["example1_R2.fastq", "example2_R2.fastq"],
"-U": ["example1_U.fastq", "example2_U.fastq"],
}
input_list, prefix_mapping = convert_input_params_to_prefix_mapping(tag_to_param_map)
assert sorted(input_list) == sorted([
"example1_R1.fastq",
"example2_R1.fastq",
"example1_R2.fastq",
"example2_R2.fastq",
"example1_U.fastq",
"example2_U.fastq",
])
assert prefix_mapping == {
"example1_R1.fastq": {
"prefix": "-1",
"order": 0,
},
"example1_R2.fastq": {
"prefix": "-2",
"order": 0,
},
"example1_U.fastq": {
"prefix": "-U",
"order": 0,
},
"example2_R1.fastq": {
"prefix": "-1",
"order": 1,
},
"example2_R2.fastq": {
"prefix": "-2",
"order": 1,
},
"example2_U.fastq": {
"prefix": "-U",
"order": 1,
},
}
1 change: 1 addition & 0 deletions toolchest_client/tools/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from .bowtie2 import Bowtie2
from .bracken import Bracken
from .cellranger import CellRangerCount
from .centrifuge import Centrifuge
from .clustalo import ClustalO
from .demucs import Demucs
from .diamond import DiamondBlastp, DiamondBlastx
Expand Down
Loading