Skip to content

all-of-us/long-reads-public-codebase

Repository files navigation

All of Us Long Read Phase 1 Workflows

This repository contains all of the reproducible WDL workflows used in Phase 1 of the All of Us Long Read (AoU-LR) project. These workflows cover various steps of long-read genomic analysis and are provided for transparency and reuse.

Please note: this code's organization is in flux.


Dockstore Workflows (Organized by Functionality)

1. Read and Assembly Processing

Workflow Location Description
HiFiBamToFastQ Dockstore Converts PacBio HiFi BAM files to FASTQ format for downstream analysis.
MergeFastqs Dockstore Merges multiple FASTQ files into a single set per sample.
PBAssembleWithHifiasm Dockstore Assembles PacBio HiFi reads using Hifiasm.
MapAssemblyContigs Dockstore Aligns assembly contigs to a reference to validate assemblies or identify SVs.
EvaluateAssemblyHap Dockstore Assesses haplotype-resolved assemblies against reference sequences.

2. Small Variant Calling & Summaries

Workflow Location Description
1074.T2T.SmallVariantsBasicMetrics Dockstore Computes metrics for small variants on T2T reference.
PBCCSWholeGenome Dockstore Calls small variants genome-wide from PacBio CCS reads.
SummarizeDVPSmallVariants Dockstore Summarizes variants from DeepVariant+Pepper pipeline.
SummarizePAVSmallVariants Dockstore Summarizes small variants discovered in PAV contexts.

3. Structural Variant (SV) Discovery & Integration

Workflow Location Description
PAV Dockstore Detects presence/absence variants (large insertions/deletions).
PAV2SVs Dockstore Converts PAV results to standard SV calls.
LRMergeSVVCFs Dockstore Merges multiple SV callsets into one.
TruvariCollapse Dockstore Collapses duplicate/equivalent SVs into consensus calls.
TruvariIntersample Dockstore Compares SVs between samples.
TruvariIntrasample Dockstore Compares SVs within a single sample.
SummarizePAVSVs Dockstore Summarizes PAV structural variant calls.
SummarizeSnifflesSVs Dockstore Summarizes SVs called by Sniffles.
GraphEvaluation Dockstore Builds and evaluates SV overlap graphs.

4. Joint Calling & Cohort Integration

Workflow Location Description
JointCalling Dockstore Jointly genotypes SVs across a cohort.
LRJointCallGVCFs Dockstore Joint genotyping of GVCFs into a cohort-wide VCF.
MergeVCFs Dockstore Merges multiple VCFs into one.
MergePhasedVCF Dockstore Merges phased VCFs into one.
MergeRegenotypedIntersampleVcf Dockstore Merges per-sample re-genotyped VCFs into cohort VCF.
MergeSVsSNPs Dockstore Combines SVs with SNVs/indels into one file.
OverlapGraph Dockstore Builds an overlap graph across callsets.
OverlapStats Dockstore Computes overlap statistics between callsets.

5. Long-Read Phasing and Imputation

Workflow Location Description
PhysicalPhasing Dockstore Physically phases SNVs/indels and SVs in a single sample with HiPhase.
ChromosomePhasedPanelCreationFromHiPhase Dockstore Per chromosome, performs statistical phasing and imputation of SNVs/indels and SVs in a cohort with SHAPEIT4, removes colliding variants, and creates a pangenome bubble-graph reference panel.
ConcatAndEvaluate Dockstore Concatenates per-chromosome pangenome bubble-graph reference panels and runs leave-out and Vcfdist evaluations.

6. Short-Read Genotyping, Phasing, and Imputation

Workflow Location Description
KAGEPanelWithPreprocessing Dockstore Per chromosome, creates a kmer index and count model for KAGE genotyping from a reference panel.
KAGECasePerChromosomeFlexscattered Dockstore Genotypes a single sample against a reference panel with KAGE.
GLIMPSEBatchedCasePerChromosomeSingleBatch Dockstore Performs phasing and imputation of a batch of genotyped samples against a reference panel with GLIMPSE.
HierarchicallyMergeVcfs Dockstore Hierarchically merges cohort VCFs using either bcftools or ivcfmerge.

7. Quality Control & Fingerprinting

Workflow Location Description
CollectSingleSampleSVvcfMetrics Dockstore Computes SV metrics per sample.
LongReadsContaminationEstimation Dockstore Estimates contamination in long-read data.
BuildTempLocalFpStore Dockstore Builds temporary fingerprint store for identity checks.
VerifyFingerprintCCSSample Dockstore Verifies CCS sample identity by fingerprinting.
SexCheck Dockstore Checks reported vs genetic sex.
MainVcfQc Dockstore Runs quality control checks on final VCFs.

Jupyter Notebooks

This repository also contains Jupyter notebooks for data analysis and visualization, organized by platform:

Terra Notebooks (notebooks/terra/)

These notebooks are designed to run in the Terra cloud platform and focus on data processing, analysis, and quality control:

Data Import and Processing

Notebook Link Description
main_init_subset_vds.ipynb GitHub Initialize and subset Variant Dataset (VDS) for analysis

Assembly Analysis

Notebook Link Description
kvg_examine_assemblies.ipynb GitHub Examine and analyze genome assemblies
kvg_study_read_length_dists.ipynb GitHub Study read length distributions from sequencing data

Variant Analysis

Notebook Link Description
kvg_examine_small_variants.ipynb GitHub Analyze small variants (SNPs, indels)
kvg_examine_structural_variants.ipynb GitHub Examine structural variants (SVs)
kvg_sv_callset_inventory.ipynb GitHub Inventory and catalog structural variant callsets
kvg_describe_hail_matrix_tables.ipynb GitHub Describe Hail matrix tables for genomic data

Population Genetics and Statistics

Notebook Link Description
kvg_pca.ipynb GitHub Principal Component Analysis for population structure
kvg_pca_hgdp_tgp.ipynb GitHub PCA analysis incorporating HGDP and TGP reference populations
kvg_recompute_relatedness.ipynb GitHub Recompute relatedness estimates between samples
kvg_compute_sfs_grch38.ipynb GitHub Compute Site Frequency Spectrum on GRCh38 reference
kvg_firth_logistic_regression.ipynb GitHub Firth logistic regression analysis

Phasing and Panel Analysis

Notebook Link Description
hangsu_hiphase_results.ipynb GitHub Analysis of HIPHASE phasing results

Quality Control

Notebook Link Description
ym_callset_QC_py.ipynb GitHub Python-based callset quality control
ym_callset_QC_R.ipynb GitHub R-based callset quality control

Manuscript Figures and Tables

Notebook Link Description
main_figure_01_pca.ipynb GitHub Generate PCA figure for main manuscript
main_table_02_sv_summary.ipynb GitHub Generate structural variant summary table
main_table_02_variant_inventory.ipynb GitHub Generate variant inventory table

Researcher Workbench Notebooks (notebooks/rw/)

These notebooks are designed to run in the All of Us Researcher Workbench and focus on manuscript figures, tables, and specialized analyses:

Manuscript Figures

Notebook Link Description
main_figure_01_length_distributions.ipynb GitHub Generate read and contig length distribution figures
main_figure_01_map.ipynb GitHub Generate map figure for manuscript
main_figure_01_omop.ipynb GitHub Generate OMOP-related figure
main_figure_01_pca.ipynb GitHub Generate PCA figure for main manuscript
supp_figure_01_assembly.ipynb GitHub Generate supplementary assembly figure

Manuscript Tables

Notebook Link Description
main_table_01_dataset_summary.ipynb GitHub Generate dataset summary table
main_table_02_short_read_svs.ipynb GitHub Generate short read structural variant table
main_table_02_variant_inventory.ipynb GitHub Generate variant inventory table

Specialized Analyses

Notebook Link Description
init_subset_vds.ipynb GitHub Initialize and subset Variant Dataset
JW_CYP2D6.ipynb GitHub CYP2D6 gene analysis
JW_repeat_expansion_figures.ipynb GitHub Repeat expansion analysis and figures
kvg_firth_logistic_regression.ipynb GitHub Firth logistic regression analysis
kvg_pmi_skip_participants.ipynb GitHub PMI participant filtering analysis
LR_SV_disease_associations.ipynb GitHub SV-disease association analysis in 1,027 All of Us Phase 1 samples

About

This Repo will be for the sharing of code for the long reads analysis initiative.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5

Languages