Skip to content

Conversation

@lebovic
Copy link
Member

@lebovic lebovic commented Dec 24, 2021

Adds:

  • megahit

Note that the extra commits below are from a squash merge on the last staging -> main commit.

bcai2 and others added 20 commits December 8, 2021 12:48
Adds support for inputs that are provided as AWS S3 URIs.

Parallelization is disabled for any job that uses an S3 URI input, since this would require redownloading the file from S3, splicing locally, and reuploading the spliced inputs to S3.

Note: Uploading is skipped since the input file will already be in S3. However, the status is still updated to TRANSFERRING_FROM_CLIENT and TRANSFERRING_TO_CLIENT, like the status path for a locally uploaded file.
Adds a check on the size of an input provided as an S3 URI, as a safeguard. The limit is the same as the size of a user-uploaded file (4.5 GB).
Modifies:
- Refactors propagation of S3 tags for inputs in tools/tool.py and api/query.py.
- Enforces uniform naming scheme of variables/methods (inputs_are_in_s3(), input_is_in_s3).
Adds:
- More STAR args
- Add multiple levels of tool_arg handling (whitelist, dangerlist, blacklist)
- Error on unknown or blacklisted args
- Reduce complexity (validation and parallelization for now) if a dangerous argument is passed
Modifies:
- Test tool per-file limit from 4.5 GB (default) to 256 GB
… sizes (#63)

Modifies:
- What values is used to limit per-tool upload sizes
Modifies:
- Actually download files
Modifies:
- The STAR per-file limit
…across the board (#68)

Adds:
- Output file compression across the board, except if parallelism is actively being used.
- Explicit `parallelize=True` flag.
- 32 GB of pure EBS disk for shi7 and an increase limit of 5GB per file. This can be trivially increased.

Modifies:
- Output file format for many tools.

Removes:
- cutadapt (no longer in use)
- shi7 parallelization (no longer needed)
- shogun parallelization (no longer needed)

Associated design doc: https://docs.google.com/document/d/1vn26gPgLHvSqDREpLXoFp8LV_5wDhqWRGXkCBuKTknM/edit#
Sets the default database for bowtie2 to GRCh38_noalt_as (the no-alt GRCh38 analysis set), found at https://benlangmead.github.io/aws-indexes/bowtie.

Co-authored-by: Noah Lebovic <noah@lebovic.com>
Adds:
- support for `megahit`, via `toolchest.megahit()`.

Of note:
- `megahit` processes inputs via 4 different tags (`-1`, `-2`, `--12`, `-r`) that can each represent an arbitrarily large number of input files (comma-delineated in a single string). 
- Input files can be passed in as an individual path (`'/path/to/file'`) or as a list of paths (`['/path/to/file1', '/path/to/file2']`). When inputs are given in a list, the lists are flattened into one large `inputs` list, and each input file is registered in the API with its associated input tag (`-1`, `-2`, `--12`, `-r`). (If a tag has multiple inputs, the tag argument is subsequently processed by the worker node.)
- Though `megahit` can take in an arbitrarily large number of input files, the input file limit for megahit is set at 10.

Co-authored-by: Noah Lebovic <noah@lebovic.com>
@lebovic lebovic merged commit 58955c6 into main Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants