Skip to content

Implement genomics pipeline #37

@dlebauer

Description

@dlebauer

Overview

  • TERRA Reference team coordinate implementation of a basic genome seq pipeline (e.g. Process Reads —> Map & Assemble Reads —> Call Variants —> Annotate Variants) described in Proposed formats and databases for genomics data reference-data#19 and summarized in fig. below
  • Most of our 400 lines will be resequenced, but ~40 lines will be sequenced for de novo assembly, so the pipeline(s) will need to accommodate both of these pipeline paths.
  • Mike Gore will lead the system architecture

image

Questions to address:

Resources?

  • How much computing resources are required?
  • What are the data sizes?
    • sequencing coverage
    • number of samples, libraries, lanes
    • expected rate of data production over time
  • do we have sample datasets so that we can set up the pipeline prior to receiving data? (perhaps re-create Maize pipeline)
  • What software needs to be installed

Division of Labor

  • Who does what? Among HPCBio, NCSA, Danforth, Cornell, other teams
  • What will be done where?
  • How will the data move from one location to another? At what stage in their processing?
  • To what extent do the workflows need to be automated?
  • Pipeline has been implemented on several systems at NCSA; code is available on Github: https://github.com/HPCBio/BW_VariantCalling (Documentation)
    • Can TERRA use this workflow? What modifications will be necessary? Or would it be worthwhile to start from scratch?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions