This repository contains all of the reproducible WDL workflows used in Phase 1 of the All of Us Long Read (AoU-LR) project. These workflows cover various steps of long-read genomic analysis and are provided for transparency and reuse.
Please note: this code's organization is in flux.
| Workflow | Location | Description |
|---|---|---|
| HiFiBamToFastQ | Dockstore | Converts PacBio HiFi BAM files to FASTQ format for downstream analysis. |
| MergeFastqs | Dockstore | Merges multiple FASTQ files into a single set per sample. |
| PBAssembleWithHifiasm | Dockstore | Assembles PacBio HiFi reads using Hifiasm. |
| MapAssemblyContigs | Dockstore | Aligns assembly contigs to a reference to validate assemblies or identify SVs. |
| EvaluateAssemblyHap | Dockstore | Assesses haplotype-resolved assemblies against reference sequences. |
| Workflow | Location | Description |
|---|---|---|
| 1074.T2T.SmallVariantsBasicMetrics | Dockstore | Computes metrics for small variants on T2T reference. |
| PBCCSWholeGenome | Dockstore | Calls small variants genome-wide from PacBio CCS reads. |
| SummarizeDVPSmallVariants | Dockstore | Summarizes variants from DeepVariant+Pepper pipeline. |
| SummarizePAVSmallVariants | Dockstore | Summarizes small variants discovered in PAV contexts. |
| Workflow | Location | Description |
|---|---|---|
| PAV | Dockstore | Detects presence/absence variants (large insertions/deletions). |
| PAV2SVs | Dockstore | Converts PAV results to standard SV calls. |
| LRMergeSVVCFs | Dockstore | Merges multiple SV callsets into one. |
| TruvariCollapse | Dockstore | Collapses duplicate/equivalent SVs into consensus calls. |
| TruvariIntersample | Dockstore | Compares SVs between samples. |
| TruvariIntrasample | Dockstore | Compares SVs within a single sample. |
| SummarizePAVSVs | Dockstore | Summarizes PAV structural variant calls. |
| SummarizeSnifflesSVs | Dockstore | Summarizes SVs called by Sniffles. |
| GraphEvaluation | Dockstore | Builds and evaluates SV overlap graphs. |
| Workflow | Location | Description |
|---|---|---|
| JointCalling | Dockstore | Jointly genotypes SVs across a cohort. |
| LRJointCallGVCFs | Dockstore | Joint genotyping of GVCFs into a cohort-wide VCF. |
| MergeVCFs | Dockstore | Merges multiple VCFs into one. |
| MergePhasedVCF | Dockstore | Merges phased VCFs into one. |
| MergeRegenotypedIntersampleVcf | Dockstore | Merges per-sample re-genotyped VCFs into cohort VCF. |
| MergeSVsSNPs | Dockstore | Combines SVs with SNVs/indels into one file. |
| OverlapGraph | Dockstore | Builds an overlap graph across callsets. |
| OverlapStats | Dockstore | Computes overlap statistics between callsets. |
| Workflow | Location | Description |
|---|---|---|
| PhysicalPhasing | Dockstore | Physically phases SNVs/indels and SVs in a single sample with HiPhase. |
| ChromosomePhasedPanelCreationFromHiPhase | Dockstore | Per chromosome, performs statistical phasing and imputation of SNVs/indels and SVs in a cohort with SHAPEIT4, removes colliding variants, and creates a pangenome bubble-graph reference panel. |
| ConcatAndEvaluate | Dockstore | Concatenates per-chromosome pangenome bubble-graph reference panels and runs leave-out and Vcfdist evaluations. |
| Workflow | Location | Description |
|---|---|---|
| KAGEPanelWithPreprocessing | Dockstore | Per chromosome, creates a kmer index and count model for KAGE genotyping from a reference panel. |
| KAGECasePerChromosomeFlexscattered | Dockstore | Genotypes a single sample against a reference panel with KAGE. |
| GLIMPSEBatchedCasePerChromosomeSingleBatch | Dockstore | Performs phasing and imputation of a batch of genotyped samples against a reference panel with GLIMPSE. |
| HierarchicallyMergeVcfs | Dockstore | Hierarchically merges cohort VCFs using either bcftools or ivcfmerge. |
| Workflow | Location | Description |
|---|---|---|
| CollectSingleSampleSVvcfMetrics | Dockstore | Computes SV metrics per sample. |
| LongReadsContaminationEstimation | Dockstore | Estimates contamination in long-read data. |
| BuildTempLocalFpStore | Dockstore | Builds temporary fingerprint store for identity checks. |
| VerifyFingerprintCCSSample | Dockstore | Verifies CCS sample identity by fingerprinting. |
| SexCheck | Dockstore | Checks reported vs genetic sex. |
| MainVcfQc | Dockstore | Runs quality control checks on final VCFs. |
This repository also contains Jupyter notebooks for data analysis and visualization, organized by platform:
These notebooks are designed to run in the Terra cloud platform and focus on data processing, analysis, and quality control:
| Notebook | Link | Description |
|---|---|---|
| main_init_subset_vds.ipynb | GitHub | Initialize and subset Variant Dataset (VDS) for analysis |
| Notebook | Link | Description |
|---|---|---|
| kvg_examine_assemblies.ipynb | GitHub | Examine and analyze genome assemblies |
| kvg_study_read_length_dists.ipynb | GitHub | Study read length distributions from sequencing data |
| Notebook | Link | Description |
|---|---|---|
| kvg_examine_small_variants.ipynb | GitHub | Analyze small variants (SNPs, indels) |
| kvg_examine_structural_variants.ipynb | GitHub | Examine structural variants (SVs) |
| kvg_sv_callset_inventory.ipynb | GitHub | Inventory and catalog structural variant callsets |
| kvg_describe_hail_matrix_tables.ipynb | GitHub | Describe Hail matrix tables for genomic data |
| Notebook | Link | Description |
|---|---|---|
| kvg_pca.ipynb | GitHub | Principal Component Analysis for population structure |
| kvg_pca_hgdp_tgp.ipynb | GitHub | PCA analysis incorporating HGDP and TGP reference populations |
| kvg_recompute_relatedness.ipynb | GitHub | Recompute relatedness estimates between samples |
| kvg_compute_sfs_grch38.ipynb | GitHub | Compute Site Frequency Spectrum on GRCh38 reference |
| kvg_firth_logistic_regression.ipynb | GitHub | Firth logistic regression analysis |
| Notebook | Link | Description |
|---|---|---|
| hangsu_hiphase_results.ipynb | GitHub | Analysis of HIPHASE phasing results |
| Notebook | Link | Description |
|---|---|---|
| ym_callset_QC_py.ipynb | GitHub | Python-based callset quality control |
| ym_callset_QC_R.ipynb | GitHub | R-based callset quality control |
| Notebook | Link | Description |
|---|---|---|
| main_figure_01_pca.ipynb | GitHub | Generate PCA figure for main manuscript |
| main_table_02_sv_summary.ipynb | GitHub | Generate structural variant summary table |
| main_table_02_variant_inventory.ipynb | GitHub | Generate variant inventory table |
These notebooks are designed to run in the All of Us Researcher Workbench and focus on manuscript figures, tables, and specialized analyses:
| Notebook | Link | Description |
|---|---|---|
| main_figure_01_length_distributions.ipynb | GitHub | Generate read and contig length distribution figures |
| main_figure_01_map.ipynb | GitHub | Generate map figure for manuscript |
| main_figure_01_omop.ipynb | GitHub | Generate OMOP-related figure |
| main_figure_01_pca.ipynb | GitHub | Generate PCA figure for main manuscript |
| supp_figure_01_assembly.ipynb | GitHub | Generate supplementary assembly figure |
| Notebook | Link | Description |
|---|---|---|
| main_table_01_dataset_summary.ipynb | GitHub | Generate dataset summary table |
| main_table_02_short_read_svs.ipynb | GitHub | Generate short read structural variant table |
| main_table_02_variant_inventory.ipynb | GitHub | Generate variant inventory table |
| Notebook | Link | Description |
|---|---|---|
| init_subset_vds.ipynb | GitHub | Initialize and subset Variant Dataset |
| JW_CYP2D6.ipynb | GitHub | CYP2D6 gene analysis |
| JW_repeat_expansion_figures.ipynb | GitHub | Repeat expansion analysis and figures |
| kvg_firth_logistic_regression.ipynb | GitHub | Firth logistic regression analysis |
| kvg_pmi_skip_participants.ipynb | GitHub | PMI participant filtering analysis |
| LR_SV_disease_associations.ipynb | GitHub | SV-disease association analysis in 1,027 All of Us Phase 1 samples |