Skip to content

Features/requirements for Be The Match collaboration #9

@heuermh

Description

@heuermh

For lack of a better place for this, our collaboration with Be The Match will require

  • Download BAM files from s3, transform to ADAM Avro+Parquet, and upload to s3 (transform_alignments)
  • Download ADAM Avro+Parquet alignments for multiple samples from s3, update record groups to prevent collision, merge into a single multi-sample ADAM Avro+Parquet alignments data set, and upload to s3 (merge_alignments)
  • Report BAM file sizes, single sample ADAM Avro+Parquet alignments file sizes, and merged ADAM Avro+Parquet alignments file size
  • Download VCF files from s3, transform to ADAM Avro+Parquet variants and genotypes, and upload to s3 (transform_variants, transform_genotypes)
  • Download ADAM Avro+Parquet variants for multiple samples, merge into a single sites-only ADAM Avro-Parquet variants data set, and upload to s3 (merge_variants)
  • Download ADAM Avro+Parquet genotypes for multiple samples, merge into a single multi-sample ADAM Avro-Parquet genotypes data set, and upload to s3 (merge_genotypes)
  • Report VCF file sizes, single sample ADAM Avro+Parquet variants and genotypes file sizes, and merged ADAM Avro+Parquet variants and genotypes file sizes
  • Notebook with queries to compare native file via s3 vs. transformed via s3 access performance
  • Documentation on how to run this stuff
  • Short manuscript on transformation process, storage requirements, and access performance

There hasn't been an ask for realigning reads, recalling variants, annotating variants with SnpEff, or joint genotyping yet, but there could be in the near future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions