Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GH-1172] Bulk Ingest Workflow Outputs #331

Merged
merged 14 commits into from
Feb 25, 2021

Conversation

ehigham
Copy link
Member

@ehigham ehigham commented Feb 24, 2021

RR: https://broadinstitute.atlassian.net/browse/GH-1172
The sarscov2_illumina_full pipeline has many outputs, many of which are files. Ingesting them one at a time is really slow and so Jade recommend using the "bulk" insert endpoints.
In this change:

  • Add bulk insert endpoint wrapper
  • Gather all (unique) files in a workflow and bulk ingest them
    • the same file can appear in more than one location in pipeline outputs
    • it's an error to try to ingest the same file twice.
    • We're limited to 1000 files per bulk ingest, Muscles says it is "probably fine" to pin ingest sizes to this, so the implementation uses batches of 1000. Probably should test if there's a sweet spot but it's good enough for now.
  • Add mime-type support for new file extensions
    • There are some that are not determinable from the file extension, so in the test I've added a way to exclude those
  • Refactor the type dispatch logic into a function traverse for re-use.

Note:

  • In this change, I'm suggesting that we give TDR bucket reader access instead of reader access for each object in the outputs. This may be controversial.

@ehigham ehigham changed the title sarscov2_illumina_full [GH-1172] Bulk Ingest Workflow Outputs Feb 25, 2021
@ehigham ehigham marked this pull request as ready for review February 25, 2021 01:33
@rexwangcc rexwangcc self-requested a review February 25, 2021 16:47
Copy link
Contributor

@rexwangcc rexwangcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did one pass, looks good to me!

api/src/wfl/mime_type.clj Show resolved Hide resolved
api/src/wfl/service/datarepo.clj Show resolved Hide resolved
  - prefix with "workflow-launcher" to be more easily queryable
  - Use a UTC suffix to differentiate between loads
  - Simlify timestamp generation
@ehigham ehigham merged commit 09a324e into main Feb 25, 2021
@ehigham ehigham deleted the ehigham/GH-1172-sarscov2-illumina-full branch February 25, 2021 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants