This is a tutorial for (1) automatically functionally annotating the variants of whole-genome/whole-exome sequencing (WGS/WES) studies and integrating the functional annotations with the genotype data using FAVORannotator and (2) performing single-/multi-trait association analysis of WGS/WES studies, summarizing and visualization results using STAARpipeline and STAARpipelineSummary. The software prerequisites, dependencies and installation can be found in STAARpipeline and STAARpipelineSummary packages.
FAVORannotator, STAARpipeline and STAARpipelineSummary are implemented as a collection of apps for cloud-based platforms. Please see the following apps
favorannotator (BDC-Seven Bridges, RAP-DNAnexus)
staarpipeline (BDC-Seven Bridges, RAP-DNAnexus)
staarpipelinesummary_varset (BDC-Seven Bridges, RAP-DNAnexus)
staarpipelinesummary_indvar (BDC-Seven Bridges, RAP-DNAnexus)
that run on the NIH/NHLBI BioData Catalyst (BDC) ecosystem and the UK Biobank Research Analysis Platform (RAP) for more details (user manual and tutorial).
R/Bioconductor package SeqArray provides functions to convert the genotype data (in VCF/BCF/PLINK BED/SNPRelate format) to SeqArray GDS format. For more details on usage, please see the R/Bioconductor package SeqArray [manual]. A wrapper for the seqVCF2GDS
/seqBCF2GDS
function in the SeqArray package can be found here (Credit: Michael R. Brown and Jennifer A. Brody).
R package gds2bgen provides functions to convert the genotype data (in BGEN format) to SeqArray GDS format. For more details on usage, please see the R package gds2bgen. An example for the seqBGEN2GDS
function in the gds2bgen package can be found here (Credit: Xiuwen Zheng).
Note 1: As a file integrity check, it is expected that variant in the GDS file can be uniquely identified based on its CHR-POS-REF-ALT combination. That is, there shouldn't be two variants in the GDS file with identical CHR-POS-REF-ALT records. It is also expected that the physical positions of variants in the GDS file (of each chromosome) should be sorted in ascending order.
Note 2: After the GDS file is generated, there is supposed to be a channel in the GDS file (default is annotation/filter
) where all variants passing the quality control (QC) should be labeled as "PASS"
. If there is no such channel for a given post-QC GDS file (where all variants in the GDS file are pass variants), one can create a new channel in the GDS file by setting the value of all variants as "PASS"
. An example script can be found here. Then, in all scripts of STAARpipeline, QC_label <- "annotation/filter"
should be updated to QC_label <- "annotation/info/QC_label"
.
FAVORannotator (CSV version 1.0.0) depends on the xsv software and the FAVOR database in CSV format. Please install the xsv software and download the FAVOR essential database CSV files from FAVOR website (under the "FAVORannotator" tab's top panel, 31.2 GB for chr1 CSV) or Harvard Dataverse before using FAVORannotator (CSV version 1.0.0).
The following steps are for the widely used operating system (Ubuntu) on a virtual machine.
- Install Rust and Cargo:
$ curl https://sh.rustup.rs -sSf | sh
- Source the environment:
$ source $HOME/.cargo/env
- Install xsv using Cargo:
$ cargo install xsv
Script: Varinfo_gds.R
Input: GDS files of each chromosome and the FAVOR database information FAVORdatabase_chrsplit.csv. For more details, please see the R script.
Output: CSV files of the variants list. For each chromosome, the number of CSV files is listed in FAVORdatabase_chrsplit.csv.
Note: The physical positions of variants in the GDS file (of each chromosome) should be sorted in ascending order.
Script: Annotate.R
Input: CSV files of the variants list to be annotated, the FAVOR database information FAVORdatabase_chrsplit.csv,
the FAVOR database, and the directory xsv software. For more details, please see the R script.
Anno_chrXX.csv
: a CSV file containing annotated variants list of chromosome XX.Anno_chrXX_STAARpipeline.csv
: a CSV file containing the variants list with annotations required for STAARpipeline of chromosome XX. The annotations in this file is a subset ofAnno_chrXX.csv
.
Script: gds2agds.R
Input: GDS files and the CSV files of annotated variants list (Anno_chrXX.csv
or Anno_chrXX_STAARpipeline.csv
). For more details, please see the R script.
Note: FAVORannotator also supports the database in SQL format. Please see the FAVORannotator tutorial for detailed usage of FAVORannotator (SQL version).
R package FastSparseGRM provides functions and a pipeline to efficiently calculate genetic principal components (PCs) and the ancestry-adjusted sparse genetic relatedness matrix (GRM). It accounts for population heterogeneity using genetic PCs which are automatically calculated as part of the pipeline. The genetic PCs can be used as fixed effect covariates to account for the population stratification and the sparse GRM can be used to model the random effects to account for the sample relatedness in a mixed effects phenotype-genotype association testing model implemented in STAARpipeline. For more details on usage, please see the R package FastSparseGRM.
Script: Association_Analysis_PreStep.r
agds_dir.Rdata
: a vector containing directory of GDS/aGDS files of all chromosomes.Annotation_name_catalog.Rdata
: a data frame containing the annotation name and the corresponding channel name in the aGDS file. Alternatively, one can skip this part in the R script by providingAnnotation_name_catalog.csv
with the same information. An example ofAnnotation_name_catalog.csv
can be found here.jobs_num.Rdata
: a data frame containing the number of jobs for association analysis, including individual analysis, sliding window analysis and dynamic window analysis (SCANG-STAAR).
Script: STAARpipeline_Null_Model.r or STAARpipeline_Null_Model_GENESIS.r or STAARpipeline_Null_Model_Multi.r
STAARpipeline_Null_Model.r
fits the STAAR null model using the STAARpipeline package.STAARpipeline_Null_Model_GENESIS.r
fits the null model using the GENESIS package and convert it to the STAAR null model using the STAARpipeline package.STAARpipeline_Null_Model_Multi.r
fits the MultiSTAAR null model using the STAARpipeline package.
Input: Phenotype data and (sparse) genetic relatedness matrix. For more details, please see the R scripts.
Note: Once the STAAR or MultiSTAAR null model is fit, all the remaining steps of STAARpipeline and STAARpipelineSummary share the same scripts (the information of single-trait or multi-trait analysis being considered is automatically retrieved from the null model object).
Perform single-variant analysis for common and low-frequency variants across the genome using the STAARpipeline package.
Input: aGDS files and the STAAR or MultiSTAAR null model. For more details, please see the R script.
The number of output files is the summation of the column "individual_analysis_num" for the object in jobs_num.Rdata
.
Perform gene-centric analysis for coding rare variants using the STAARpipeline package. The gene-centric coding analysis provides five functional categories to aggregate coding rare variants of each protein-coding gene: (1) putative loss of function (stop gain, stop loss, and splice) RVs, (2) missense RVs, (3) disruptive missense RVs, (4) putative loss of function and disruptive missense RVs, and (5) synonymous RVs.
STAARpipeline_Gene_Centric_Coding.r
performs gene-centric coding analysis for all protein-coding genes across the genome. There are 379 jobs using this script.STAARpipeline_Gene_Centric_Coding_Long_Masks.r
performs gene-centric coding analysis for some specific long masks, and might require larger memory compared toSTAARpipeline_Gene_Centric_Coding.r
. There are 2 jobs using this script.
Input: aGDS files and the STAAR or MultiSTAAR null model. For more details, please see the R scripts.
Script: STAARpipeline_Gene_Centric_Noncoding.r, STAARpipeline_Gene_Centric_Noncoding_Long_Masks.r, STAARpipeline_Gene_Centric_ncRNA.r and STAARpipeline_Gene_Centric_ncRNA_Long_Masks.r
Perform gene-centric analysis for noncoding rare variants using the STAARpipeline package. The gene-centric noncoding analysis provides eight functional categories of regulatory regions to aggregate noncoding rare variants: (1) promoter RVs overlaid with CAGE sites, (2) promoter RVs overlaid with DHS sites, (3) enhancer RVs overlaid with CAGE sites, (4) enhancer RVs overlaid with DHS sites, (5) untranslated region (UTR) RVs, (6) upstream region RVs, (7) downstream region RVs, and (8) noncoding RNA (ncRNA) RVs.
STAARpipeline_Gene_Centric_Noncoding.r
performs gene-centric noncoding analysis for all protein-coding genes across the genome. There are 379 jobs using this script.STAARpipeline_Gene_Centric_Noncoding_Long_Masks.r
performs gene-centric noncoding analysis for some specific long masks, and might require larger memory compared toSTAARpipeline_Gene_Centric_Noncoding.r
. There are 8 jobs using this script.STAARpipeline_Gene_Centric_ncRNA.r
performs gene-centric noncoding analysis for ncRNA genes across the genome. There are 222 jobs using this script.STAARpipeline_Gene_Centric_ncRNA_Long_Masks.r
performs gene-centric noncoding analysis for some specific long masks, and might require larger memory compared toSTAARpipeline_Gene_Centric_ncRNA.r
. There is 1 job using this script.
Input: aGDS files and the STAAR or MultiSTAAR null model. For more details, please see the R scripts.
Output: 387 Rdata files with the user-defined names for protein-coding genes and 223 Rdata files with the user-defined names for ncRNA genes.
Script: STAARpipeline_Sliding_Window.r
Perform sliding window analysis using the STAARpipeline package.
Input: aGDS files and the STAAR or MultiSTAAR null model. For more details, please see the R script.
The number of output files is the summation of the column "sliding_window_num" for the object in jobs_num.Rdata
.
Script: STAARpipeline_STAAR2SCANG.r
Generate the SCANG-STAAR null model using the STAAR null model.
Script: STAARpipeline_Dynamic_Window.r
Perform dynamic window analysis using the STAARpipeline package.
The number of output files is the summation of the column "scang_num" for the object in jobs_num.Rdata
.
Step 0 (Optional): Select independent variants from a known variants list to be used in conditional analysis
Perform LD pruning (stepwise selection) to select the subset of independent variants from a known variants list to be used in conditional analysis.
Input: aGDS files, a list of known variants (4-column "CHR-POS-REF-ALT" format), and the STAAR or MultiSTAAR null model.
STAARpipelineSummary_Known_Loci_Info.r extracts the information of CHR, POS, REF, and ALT from #rs. For more details, please see the R script.
STAARpipelineSummary_Known_Loci_Pruning_Combination.r combines chromosome-wide results into genome-wide.
Summarize single-variant analysis results and perform conditional analysis of unconditionally significant variants by adjusting a list of known variants.
Input: aGDS files, individual analysis results generated by STAARpipeline, the STAAR or MultiSTAAR null model, and a list of known variants. For more details, please see the R script.
Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant variants.
Note: STAARpipelineSummary_Known_Loci_Individual_Analysis_Pruning.r and STAARpipelineSummary_Known_Loci_Individual_Analysis_Pruning_Combination.r show an example to select independent variants from both the known variants in literature and significant single variants detected in individual analysis, which can be used for variant-set conditional analysis.
Summarize gene-centric coding analysis results and perform conditional analysis of unconditionally significant coding masks by adjusting a list of known variants.
Input: aGDS files, gene-centric coding analysis results generated by STAARpipeline, the STAAR or MultiSTAAR null model, and a list of known variants. For more details, please see the R script.
Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant coding masks.
Summarize gene-centric noncoding analysis results and perform conditional analysis of unconditionally significant noncoding masks by adjusting a list of known variants.
Input: aGDS files, gene-centric noncoding analysis results generated by STAARpipeline, the STAAR or MultiSTAAR null model, and a list of known variants. For more details, please see the R script.
Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant noncoding masks.
Summarize sliding window analysis results and perform conditional analysis of unconditionally significant genetic regions by adjusting a list of known variants.
Input: aGDS files, sliding window analysis results generated by STAARpipeline, the STAAR or MultiSTAAR null model, and a list of known variants. For details, see the R scripts.
Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant sliding windows.
Summarize dynamic window analysis results and perform conditional analysis of unconditionally significant genetic regions by adjusting a list of known variants.
Input: aGDS files, dynamic window analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For more details, please see the R script.
Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant dynamic windows.
Functionally annotate a list of variants.
The list of variants could be the individual analysis results generated by STAARpipelineSummary.
Output: a Rdata file containing the input variants together with the corresponding functional annotations.
Functionally annotate rare variants of each of the input coding masks.
Output: For each input coding mask, the script outputs a Rdata file containing the rare variants and the corresponding functional annotations.
Functionally annotate rare variants of each of the input noncoding masks.
Output: For each input noncoding mask, the script outputs a Rdata file containing the rare variants and the corresponding functional annotations.
Functionally annotate rare variants of each of the input genetic regions.