trytoolchest · bcai2 · Oct 29, 2022 · Oct 27, 2022 · Oct 27, 2022 · Oct 28, 2022
diff --git a/docs/docs/tool-reference/taxonomic-classifiers/centrifuge.md b/docs/docs/tool-reference/taxonomic-classifiers/centrifuge.md
@@ -0,0 +1,110 @@
+**Centrifuge** is a rapid and memory-efficient classifier of DNA sequences from microbial samples.
+Centrifuge requires a relatively small genome index (e.g., 4.3 GB for ~4,100 bacterial genomes) and can process a 
+typical DNA sequencing run within an hour. For more information, 
+see the tool's [website](https://ccb.jhu.edu/software/centrifuge/) and 
+[GitHub repo](https://github.com/DaehwanKimLab/centrifuge).
+
+Function Call
+=============
+
+```python
+tc.centrifuge(
+    output_path=None,
+    tool_args="",
+    database_name="centrifuge_refseq_bacteria_archaea_viral_human",
+    database_version="1",
+    read_one=None,
+    read_two=None,
+    unpaired=None,
+    is_async=False,
+)
+```
+
+Function Arguments
+------------------
+
+| Argument           | Use in place of:                    | Description                                                                                                                                                                                                     |
+|:-------------------|:------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `read_one`         | `-1`                                | (optional) Path(s) to R1 of paired-end read input files. The files can be a local or remote, see [Using Files](../../getting-started/using-files.md).                                                           |
+| `read_two`         | `-2`                                | (optional) Path(s) to R2 of paired-end read input files. The files can be a local or remote, see [Using Files](../../getting-started/using-files.md).                                                           |
+| `unpaired`         | `-U`                                | (optional) Path(s) to unpaired input files. The files can be a local or remote, see [Using Files](../../getting-started/using-files.md).                                                                        |
+| `output_path`      | output arguments (`-S`, `--report`) | (optional) Path (directory) to where the output files will be downloaded. If omitted, skips download. The files can be a local or remote, see [Using Files](../../getting-started/using-files.md).              |
+| `tool_args`        | all other arguments                 | (optional) Additional arguments to be passed to Centrifuge. This should be a string of arguments like the command line. See [Supported Additional Arguments](#supported-additional-arguments) for more details. |
+| `database_name`    | `-x`\*                              | (optional) Name of database to use for Centrifuge classification. Defaults to `"centrifuge_refseq_bacteria_archaea_viral_human"` (Refseq bacteria / archaea / viral / human).                                   |
+| `database_version` | `-x`\*                              | (optional) Version of database to use for Centrifuge classification. Defaults to `"1"`.                                                                                                                         |
+| `is_async`         |                                     | Whether to run a job asynchronously.  See [Async Runs](../../feature-reference/async-runs.md) for more.                                                                                                         |
+
+*See the [Databases](#databases) section for more details.
+
+Output Files
+------------
+
+A Centrifuge run will output these files into  `output_path`:
+
+- `centrifuge_output.txt`: Centrifuge output (captured from `stdout`), from the `-S` argument.
+- `centrifuge_report.tsv`: Centrifuge report file, from the `--report` argument.
+
+Notes
+-----
+
+### Paired-end reads
+
+For each paired-end input, make sure the corresponding read is in the same position in the input list. For example, two 
+pairs of paired-end files – `one_R1.fastq`, `one_R2.fastq`, `two_R1.fastq`, `two_R2.fastq` – should be passed to 
+Toolchest as:
+
+```python
+tc.centrifuge(
+  read_one=["one_R1.fastq", "two_R1.fastq"],
+  read_two=["one_R2.fastq", "two_R2.fastq"],
+  ...
+)
+```
+
+Tool Versions
+=============
+
+Toolchest currently supports version **1.0.4** of Centrifuge.
+
+Databases
+=========
+
+Toolchest currently supports the following databases for Bowtie 2:
+
+| `database_name`                                        | `database_version` | Description                                                        |
+|:-------------------------------------------------------| :----------------- |:-------------------------------------------------------------------|
+| `centrifuge_refseq_bacteria_archaea_viral_human`       | `1`                | RefSeq, bacteria / archaea / viral / human, JHU source<sup>1</sup> |
+
+<sup>1</sup>These database indexes were generated by [the Langmead Lab at Johns Hopkins](https://langmead-lab.org/) and can be found on [the lab's database index page](https://benlangmead.github.io/aws-indexes/centrifuge).
+
+Supported Additional Arguments
+==============================
+
+Most additional arguments not related to input, output, or multithreading are supported:
+- \-q
+- \--qseq
+- \-f
+- \-r
+- \-c
+- \-s, \--skip
+- \-u, \--upto
+- \-5, \--trim5
+- \-3, \--trim3
+- \--phred33
+- \--phred64
+- \--int-quals
+- \--ignore-quals
+- \--nofw
+- \--norc
+- \--min-hitlen
+- \-k
+- \--host-taxids
+- \--exclude-taxids
+- \--out-fmt
+- \--tab-fmt-cols
+- \-t, \--time
+- \--qc-filter
+- \--seed
+- \--non-deterministic
+
+Additional arguments can be specified under the `tool_args` argument.
diff --git a/docs/mkdocs.yaml b/docs/mkdocs.yaml
@@ -59,6 +59,7 @@ nav:
           - AlphaFold: "tool-reference/structure-prediction/alphafold.md"
       - Taxonomic Classifiers:
           - About Taxanomic Classifiers: "tool-reference/taxonomic-classifiers.md"
+          - Centrifuge: "tool-reference/taxonomic-classifiers/centrifuge.md"
           - Kraken 2: "tool-reference/taxonomic-classifiers/kraken-2.md"
           - MetaPhlAn: "tool-reference/taxonomic-classifiers/metaphlan.md"
       - Workflows / Meta-Tools:

diff --git a/tests/test_centrifuge.py b/tests/test_centrifuge.py
@@ -0,0 +1,60 @@
+import os
+import pytest
+
+from tests.util import hash
+import toolchest_client as toolchest
+
+toolchest_api_key = os.environ.get("TOOLCHEST_API_KEY")
+if toolchest_api_key:
+    toolchest.set_key(toolchest_api_key)
+
+
+@pytest.mark.integration
+def test_centrifuge_many_types():
+    """
+    Tests Centrifuge with one pair of paired-end inputs and two single-end inputs.
+    """
+    test_dir = "temp_test_centrifuge/many_types"
+    os.makedirs(f"./{test_dir}", exist_ok=True)
+    output_dir_path = f"./{test_dir}"
+    output_file_path = f"{output_dir_path}/centrifuge_output.txt"
+    output_report_path = f"{output_dir_path}/centrifuge_report.tsv"
+
+    toolchest.centrifuge(
+        read_one="s3://toolchest-integration-tests/megahit/r3_1.fa",
+        read_two="s3://toolchest-integration-tests/megahit/r3_2.fa",
+        unpaired="s3://toolchest-integration-tests/megahit/r4.fa",
+        tool_args="-f",
+        output_path=output_dir_path,
+    )
+
+    assert hash.unordered(output_file_path) == 1779279198
+    assert hash.unordered(output_report_path) == 1100843098
+
+
+@pytest.mark.integration
+def test_centrifuge_multiple_pairs():
+    """
+    Tests Centrifuge with two pairs of paired-end inputs.
+    """
+    test_dir = "temp_test_centrifuge/multiple_pairs"
+    os.makedirs(f"./{test_dir}", exist_ok=True)
+    output_dir_path = f"./{test_dir}"
+    output_file_path = f"{output_dir_path}/centrifuge_output.txt"
+    output_report_path = f"{output_dir_path}/centrifuge_report.tsv"
+
+    toolchest.centrifuge(
+        read_one=[
+            "s3://toolchest-integration-tests/sample_r1.fastq.gz",
+            "s3://toolchest-integration-tests/r1.fastq.gz",
+        ],
+        read_two=[
+            "s3://toolchest-integration-tests/sample_r2.fastq.gz",
+            "s3://toolchest-integration-tests/r2.fastq.gz",
+        ],
+        output_path=output_dir_path,
+        volume_size=32,
+    )
+
+    assert hash.unordered(output_report_path) == 1895979303
+    assert hash.unordered(output_file_path) == 1059786093
diff --git a/toolchest_client/__init__.py b/toolchest_client/__init__.py
@@ -3,7 +3,6 @@
 import builtins
 from dotenv import load_dotenv, find_dotenv
 import functools
-import os
 
 # set __version__ module
 try:
@@ -29,7 +28,7 @@
 from toolchest_client.api.query import Query
 from toolchest_client.api.status import Status, get_status
 from toolchest_client.api.urls import get_api_url, set_api_url
-from .tools.api import add_database, alphafold, blastn, bowtie2, bracken, cellranger_count, clustalo, demucs,\
-    diamond_blastp, diamond_blastx, fastqc, humann3, jupyter, kallisto, kraken2, lastal5, lug, megahit, metaphlan, \
-    python3, rapsearch, rapsearch2, salmon, shi7, shogun_align, shogun_filter, STAR, test, transfer, unicycler, \
-    update_database
+from .tools.api import add_database, alphafold, blastn, bowtie2, bracken, cellranger_count, centrifuge, clustalo, \
+    demucs, diamond_blastp, diamond_blastx, fastqc, humann3, jupyter, kallisto, kraken2, lastal5, lug, megahit, \
+    metaphlan, python3, rapsearch, rapsearch2, salmon, shi7, shogun_align, shogun_filter, STAR, test, transfer, \
+    unicycler, update_database
diff --git a/toolchest_client/files/__init__.py b/toolchest_client/files/__init__.py
@@ -1,4 +1,5 @@
-from .general import assert_exists, check_file_size, files_in_path, sanity_check, compress_files_in_path
+from .general import assert_exists, check_file_size, files_in_path, sanity_check, compress_files_in_path, \
+    convert_input_params_to_prefix_mapping
 from .merge import concatenate_files, merge_sam_files
 from .s3 import assert_accessible_s3, get_s3_file_size, get_params_from_s3_uri, path_is_s3_uri
 from .split import open_new_output_file, split_file_by_lines, split_paired_files_by_lines

diff --git a/toolchest_client/files/general.py b/toolchest_client/files/general.py
@@ -136,3 +136,54 @@ def sanity_check(file_path):
     assert_exists(file_path, must_be_file=True)
     if os.stat(file_path).st_size <= 5:
         raise ValueError(f"File at {file_path} is suspiciously small")
+
+
+def convert_input_params_to_prefix_mapping(tag_to_param_map):
+    """
+    Parses input parameters in a Toolchest call into:
+    - a list of all input paths (for uploading)
+    - a mapping of inputs to their respective prefixes
+
+    Example input params map:
+    {
+        "-1": ["example_R1.fastq"],
+        "-2": ["example_R2.fastq"],
+        "-U": ["example_U.fastq"],
+    }
+
+    Example output list:
+    [example_R1.fastq, example_R2.fastq, example_U.fastq]
+
+    Example output prefix mapping:
+    {
+        "example_R1.fastq": {
+            "prefix": "-1",
+            "order": 0,
+        },
+        "example_R2.fastq": {
+            "prefix": "-2",
+            "order": 0,
+        },
+        "example_U.fastq": {
+            "prefix": "-U",
+            "order": 0,
+        },
+    }
+    """
+    input_list = []  # list of all inputs
+    input_prefix_mapping = {}  # map of each input to its respective tag
+    for tag, param in tag_to_param_map.items():
+        if isinstance(param, list):
+            for index, input_file in enumerate(param):
+                input_list.append(input_file)
+                input_prefix_mapping[input_file] = {
+                    "prefix": tag,
+                    "order": index,
+                }
+        elif isinstance(param, str):
+            input_list.append(param)
+            input_prefix_mapping[param] = {
+                "prefix": tag,
+                "order": 0,
+            }
+    return input_list, input_prefix_mapping
diff --git a/toolchest_client/files/tests/test_general.py b/toolchest_client/files/tests/test_general.py
@@ -3,7 +3,7 @@
 
 import pytest
 
-from .. import assert_exists, check_file_size, files_in_path, sanity_check
+from .. import assert_exists, check_file_size, files_in_path, sanity_check, convert_input_params_to_prefix_mapping
 
 THIS_FILE_PATH = os.path.normpath(pathlib.Path(__file__).parent.resolve())
 
@@ -48,3 +48,46 @@ def test_exists_but_not_file():
     dir_file_path = f"{THIS_FILE_PATH}/data"
     with pytest.raises(ValueError):
         assert_exists(dir_file_path, must_be_file=True)
+
+
+def test_generate_prefix_mapping():
+    tag_to_param_map = {
+        "-1": ["example1_R1.fastq", "example2_R1.fastq"],
+        "-2": ["example1_R2.fastq", "example2_R2.fastq"],
+        "-U": ["example1_U.fastq", "example2_U.fastq"],
+    }
+    input_list, prefix_mapping = convert_input_params_to_prefix_mapping(tag_to_param_map)
+    assert sorted(input_list) == sorted([
+        "example1_R1.fastq",
+        "example2_R1.fastq",
+        "example1_R2.fastq",
+        "example2_R2.fastq",
+        "example1_U.fastq",
+        "example2_U.fastq",
+    ])
+    assert prefix_mapping == {
+        "example1_R1.fastq": {
+            "prefix": "-1",
+            "order": 0,
+        },
+        "example1_R2.fastq": {
+            "prefix": "-2",
+            "order": 0,
+        },
+        "example1_U.fastq": {
+            "prefix": "-U",
+            "order": 0,
+        },
+        "example2_R1.fastq": {
+            "prefix": "-1",
+            "order": 1,
+        },
+        "example2_R2.fastq": {
+            "prefix": "-2",
+            "order": 1,
+        },
+        "example2_U.fastq": {
+            "prefix": "-U",
+            "order": 1,
+        },
+    }
diff --git a/toolchest_client/tools/__init__.py b/toolchest_client/tools/__init__.py
@@ -4,6 +4,7 @@
 from .bowtie2 import Bowtie2
 from .bracken import Bracken
 from .cellranger import CellRangerCount
+from .centrifuge import Centrifuge
 from .clustalo import ClustalO
 from .demucs import Demucs
 from .diamond import DiamondBlastp, DiamondBlastx