This repository contains data and scripts for Alibutud, Hansali, Cao, Zhou, Mahaganapathy, Azaro, Gwin, Wilson, Buyske, Bartlett, Flax, Brzustowciz, and Xing (2023) Structural variations contribute to the genetic etiology of autism spectrum disorder and language impairments. Int. J. Mol. Sci. 24(17), 13248
The project is divided into two pipelines (CNV and gSV/MEI). The gSV/MEI pipeline is additionally comprised of results from AF and eQTL analysis.
CNV gSV/MEI
┌────────────┐ ┌─────────────┐
│ │ │AF Pipeline │
│CNV Pipeline│ │ │
│ │ │eQTL Pipeline│
└─────┬──────┘ └──────┬──────┘
│ │
│ ┌───────────────┐ │
└───┤Candidate genes├───┘
└───────┬───────┘
│
┌────┴───┐
│Analysis│
└────────┘Data that has to be input into the pipeline from external sources, rather than being generated by the pipeline itself.
Builder phase:
- starting_data_files: microarray files from Mahaganapathy
- NJLAGS_CNV.ped: pedigree file of NJLAGS cohort
- ASD_only.ped: pedigree file for ASD_only phenotype
- ASD_LI.ped: pedigree file for ASD_only phenotype
- ASD_RI.ped: pedigree file for ASD_only phenotype
- human_g1k_v37.fasta: used for conversion to VCF
Prioritization phase:
- NDD_genes.txt: neurodevelopmental disorder genes, from SFARI
- dispensable_genes_and_muc: genes to exclude, from Rausell et al
- GTEx_2019_12_12.txt: gene expression file from GTEx
- tpm1.txt/tpm2.txt/tpm3.txt: gene expression files from other sources
- syndromes.bed: known ASD related syndromes, from SFARI
The CNV_Master.py script will run the entire pipeline from start to finish. It does so across three phases, calling the corresponding Python scripts as it does so:
- CNV_Builder.py
- Merge CNV files across batches and callers
- Convert to VCF format for use in Annotation
- CNV_Annotator.py
- QC filtration to remove outliers
- Running AnnotSV annotation on VCF
- Running GEMINI-based Python script “Geminesque.py” on annotated VCF
- Filtering on segregation analysis
- CNV_Prioritizer.py
- Prioritizing candidate variants on dispensability, coding region overlap, gene expression, internal cohort frequency
- Calling StrVCTVRE to predict pathogenicity
- Calling SvAnna to predict pathogenicity
Additionally, the CNV_Summarizer.py script creates summary tables for analysis.
CNV_Master.py already has all the commands for the constituent scripts. All that needs to be specified is the project folder where the pipeline is being run and the proband phenotype of interest.
# DECLARE GLOBAL VARIABLES
projects_folder = "PROJECT FOLDER FILEPATH"
phenotype = "PHENOTYPE" # set phenotype
print("Projects folder located at: " + projects_folder)In order to run the gSV pipeline, the following scripts in the /doc/ folder must be run sequentially:
0_merging_callsets.sh
1_annotation.sh
2_inh_segregation.sh
3_adding_datasets.sh
4_svanna_psv_score.sh
5_universal_filtering.sh
6_results.sh
7_tables_and_figures.shThis directory contains three folders with the scripts used to generate the figures and tables for this project. Two of them contain the scripts for the CNV and gSV pipeline figures/tables respectively, while the third contains the scripts for generating the protein-protein interaction network used in Figure 5.
- Starting_Data_Scripts.py - Used to build Table 1 and Table 2
- CNV_Summarizer.py - Used to build Table 5
- CNV_Plots.py - Used to plot Figure S1(A)
- CNV_Enrichment.py - Used to build table for use in Figure S1(B)
- BoxPlotEnrichment.R - Used to plot Figure S1(B)
- 7_tables_and_figures.sh is used to generate the tables and figures
- GenerateGeneLevelSummaryTable_CNV.py - Used to build CNV input table for Table 4
- GenerateGeneLevelSummaryTable_MEISV.py - Used to build gSV input table for Table 4
- GenerateGeneLevelSummaryTable.py - Used to build Table 4
- GenerateSizeDistributionChartsByMEI.py - Used to plot Figure S2
This folder contains the files and scripts used to generate Figure 5. Pathway data was drawn from ConsensusPathDB, and input genes from the results of the combined CNV/gSV pipeline.
- Network_gene.py - Used to build gene table referenced by Network_nodes_edges.py
- Network_nodes_edges.py - Used to build table defining network structure
- Network_plot.R - Used to plot PPI Network