DosaCNV is a deep multiple instance learning model that jointly predicts coding deletion pathogenicity and gene haploinsufficiency. Unlike previous approaches, which examined these two levels of predictions independently, DosaCNV enables the evaluation of HI genes' contribution to deletion pathogenicity by effectively connecting them in a biologically coherent manner. Moreover, the integration of DosaCNV-HI with Deep SHAP, a deep learning model interpretation tool, allows attributing the predicted haploinsufficiency of a gene to its gene-level features. Please read DosaCNV manuscript for more details.
The DosaCNV is implemented in Python 3 with TensorFlow 2, NumPy, Pandas and SHAP. It has been extensively tested in the following environment.
- Python 3.10.12
- TensorFlow 2.12.0
- SHAP 0.41.0
The 'precomputed_gene_score.csv' file contains precomputed DosaCNV-HI scores for 20,268 protein-coding genes. These scores are predictions of gene haploinsufficiency described in DosaCNV manuscript, indicating how likely the loss of one copy of each gene is to cause disease.
Input for the model should be sourced from an index file. In this file, each row should specify a variant ID followed by the corresponding Ensembl gene ID of the affected gene. This file can be generated by passing deletion coordinates (build GRCh37/hg19) in BED format to the 'map_gene.sh' script.
The BED file for deletion coordinates should be formatted with either:
- Four columns ('chromosome', 'start', 'end', 'variant_id') for prediction purposes.
- Five columns ('chromosome', 'start', 'end', 'variant_id', 'binary label for pathogenicity') for subsequent model training.
The 'map_gene.sh' script will associate genes with a deletion if the deletion intersects the gene's CDS by at least 1bp.
bash map_gene.sh -i <BED format input file> -m <maximum number of genes that can be considered>
-m
Specifies the maximum number of genes that can be considered per deletion. Deletions that affect more genes than this number will be discarded. The default is set to 100.- Output index file will be saved to the same directory with the '_index' suffix.
cat example_deletion.bed
chr14 21692670 21923767 nssv15119821
chr17 18934579 19051450 nssv15120400
bash map_gene.sh -i example_deletion.bed
cat example_deletion_index.txt
variant_id,gene_id,type
nssv15119821,ENSG00000092199,DEL
nssv15119821,ENSG00000092200,DEL
nssv15119821,ENSG00000092201,DEL
nssv15119821,ENSG00000100888,DEL
nssv15120400,ENSG00000154016,DEL
nssv15120400,ENSG00000189152,DEL
nssv15120400,ENSG00000214844,DEL
To execute DosaCNV, use the following command:
python DosaCNV.py [OPTIONS]
-input_data
Path to the deletion input index file for prediction. The index file should be generated by passing deletion coordinates in BED format to 'map_gene.sh' script (refer to section 'Preprocessing').-annotation
Path to annotation file containing gene-level features. By default, the './annotation/anno_for_deletion.txt' is used.-saved_model_name
Name of the pre-saved variant-level model for loading. By default, './saved_model/pretrained_deletion' is used.-output_name
Name for the output file. The output file will be saved to the './variant_score/' directory with the '_score' suffix.
python DosaCNV.py predict-deletion -input_data example_deletion_index.txt \
-output_name example_deletion
cat ./variant_score/example_deletion_score.tsv
variant_id DosaCNV_score
nssv15119821 0.9763802
nssv15120400 0.0608029
-input_data
Path to the gene input file for prediction. The file should mirror the format of, or be a subset of, the annotation file used to train the pre-saved gene-level model.-saved_model_name
Name of the pre-saved gene-level model for loading. By default, './saved_model/pretrained_gene' is used.-output_name
Name for the output file. The output file will be saved to the './gene_score/' directory with the '_score' suffix.
cat example_gene.txt
gene_id,feat1,feat2,...
ENSG00000000419,-0.68327674,-0.31548782,...
ENSG00000000457,0.08731493,0.15561921,...
ENSG00000000460,-0.68372146,-0.16481838,...
ENSG00000000938,0.64214013,1.55037751,...
ENSG00000000971,1.67342971,0.16527343,...
python DosaCNV.py predict-gene -input_data example_gene.txt \
-output_name example_gene
cat ./gene_score/example_gene_score.tsv
gene_id DosaCNV_HI_score
ENSG00000000419 0.09212542
ENSG00000000457 0.017763102
ENSG00000000460 0.030589996
ENSG00000000938 0.33497918
ENSG00000000971 0.1283754
-explain_set
Path to the gene input file for SHAP explanation. The file should mirror the format of, or be a subset of, the annotation file used to train the pre-saved gene-level model.-background_set
Path to the input file for background genes. The file should mirror the format of, or be a subset of, the annotation file used to train the pre-saved gene-level model. Defaults to './data/1000_hs_gene.txt', a subset of 'anno_for_gene.txt' with 1000 randomly sampled haplosufficient genes.-saved_model_name
Name of the pre-saved gene-level model for loading. By default, 'pretrained_gene' in './saved_model/' is used.-output_name
Name for the output file. The output file will be saved to the './shap_value/' directory with the '_shap' suffix.
python DosaCNV.py explain-gene -explain_set example_gene.txt \
-output_name example_gene
cat ./shap_value/example_gene_shap.csv
gene_id,feat1_shap,feat2_shap,...
ENSG00000000419,-0.01675624,0.003948918,...
ENSG00000000457,-0.004276398,-0.001805387,...
ENSG00000000460,-0.009483201,0.002284174,...
ENSG00000000938,0.022038622,-0.014187155,...
ENSG00000000971,0.042097561,-0.004905279,...
-train_data
Path to index file for training data. The index file should be generated by passing training deletion coordinates in BED format to 'map_gene.sh' script (refer to section 'Preprocessing').-val_data
Path to index file for validation data. The index file should be generated by passing validation deletion coordinates in BED format to 'map_gene.sh' script (refer to section 'Preprocessing').-annotation
Path to annotation file containing gene-level features.-output_model_name
Name for the output model. The resulting variant-level model will be stored in the './saved_model/' directory with the '_deletion' suffix.-save_gene_model
Optional. Include this flag to save the gene-level model (DosaCNV-HI). The resulting model will be stored in the './saved_model/' directory with the '_gene' suffix.-seed
Random seed for model initialization.-max_n_gene
The maximum number of genes that can be considered in a deletion. The default is set to 100.-lr
Learning rate for Adam optimizer.
python DosaCNV.py train -train_data training_deletion_index.txt \
-val_data validation_deletion_index.txt \
-annotation anno_for_deletion.txt/anno_for_gene.txt \
-output_model_name pretrained_deletion/pretrained_gene \
-save_gene_model