DNA methylation (DNAm) is linked to many diseases like cancer and autism. However DNAm marks can change due to environmental stimuli, cell types, gender, etc. In recent years, several DNAm studies have suggested that a large portion of DNAm variability is associated with genetic ancestry and is heritable, making DNAm a potential confounding factor which is not given enough consideration in the context of DNA methylation analysis. Differentially methylated CpG sites associated with pathology can be confounded by CpGs associated with genetic ancestry causing suprious results. Therefore, to invetigate how DNA methylation affects prenatal health, it is important for us to identify genetic ancestry-associated CpGs to figure out true positives. This DNAm variability in the placenta due to genetic ancestry needs to be accounted for in large scale DNAm studies, or else no meaningful interpretation of results can be done to assess prenatal health. In this project, we are going to investigate if DNA in placental tissue is differentially methylated across populations of different ancestry.
Hypothesis: DNA in placental tissue is differentially methylated across populations of different ancestries.
We will first find methylation profiles in subjects from our dataset 1 and the genetic ancestry (Asians or Caucasians) of our subjects is known. These profiles will then serve as a basis to cluster methylation data in our dataset 2 in which genetic ancestry is not known. For both dataset 1 and dataset 2, DNAm was measured by 450K microarray from Illumina.
For the details of the project ideas, dataset and methods we used for this project, please check our project proposal.
Name | Department/Program | GitHub ID |
---|---|---|
Victor Yuan | Genome Science and Technology | @wvictor14 |
Michael Yuen | Medical Genetics | @myuen89 |
Nivretta Thatra | Bioinformatics | @nivretta |
Ming Wan | Statistics | @MingWan10 |
Anni Zhang | Genome Science and Technology | @annizubc |
This figure summarizes the workflow of our project:
For all our processing steps below, please see the Results folder for a more detailed write up on our findings from our analyses. If you're interested in the code, see the markdown files in the Scripts folder.
We first used this script to process (via quality control, filtering, and normalization) the raw data of dataset 1 into to our processed data. For detailed information of dataset 1, please see Metadata.
We explored our data by generating sample-sample correlation heatmaps, plotting a few random CpGs and plotting the first few principal components.
We used the R package limma to identify differentially methylated probes between Asian and Caucasian samples. Please see our differential DNA methylation analysis script in the scripts folder for the code and details. Limma prioritized 13 CpG sites that are differentially methylated between Caucasian and Asian genetic ancestry using a cutoff off p value = 0.01.
To build the DNA methylation ancestry classifer, we compare SVM and elastic net logistic regression (glmnet) models. We ended up choosing glmnet for building the final model, and used a nested cross validation strategy to tune the penalization parameters, and for estimating the test error. After generating the final model, we analyzed the predictors, and examined the results of the predictions on the secondary unlabelled dataset. Please see the subdirectory predictive modeling for the markdown files and details.
We looked the 13 CpG sites prioritized by limma and the 11 CpG sites prioritized by glmnet, in this script. Using the COHCAP (City of Hope CpG Island Analysis Pipeline) package, the CpGs we mapped to chromosome, location, gene name and CpG island information. Each gene was annotated with its GO term using the package mygene.
-
Please see our poster! 😄
-
SVM performed slightly better than glmnet (for both training and testing error)
-
Final model used 11 CpG predictors and was built with glmnet with a AUC of 0.981 and 0.977+-0.024 for training and testing error respectively (α = 0.75, λ = 0.25).
-
The classifier predicted all of the unlabeled test set to Caucasian, which we doubt is the true case.
-
We suspect the test set is too ‘different’ from the training data set for the classifier to perform accurately on the test set
-
Normalizing and QCing the test and training datasets together may be necessary for DNA methylation classifiers to perform well.
-
Using MDS ancestry coordinates from population stratification meta-analyses may provide ‘labels’ to assess classifier performance or iprove model building. (self-reported ancestry can be unreliable)
-
Project proposal: includes the introduction to the ideas, dataset and methods we used in this project.
-
Progress report: contains the progress about our project.
-
Data folder contains metadata, raw data, and processed data.
-
- human placental tissue from 45 subjects with self reported ancestry
- columns correspond to subject ancestry, name, sex, gestational age and what complications they had in pregnancy (none, intrauterine growth (IUGR) restriction, or late onset preeclampsia (LOPET), neither of which affect DNAm)
- columns for Sentrix ID and position correspond to the sample’s batch ID and position on the Illumina microarray
- each row is one subject.
-
Raw data for dataset 1.
-
Processed data this folder contains the processed data processed from raw data.
-
-
Scripts folder contains the script for:
-
Preprocessing: processing the raw data
-
Exploratory Analysis: explores our processed training data to see if there are any obvious underlying structure.
-
Differential Methylation Analysis: done using limma on the processed data.
-
Building the classifer: This script is for building the ancestry classifier, as well as for the analysis of the resulting predictor CpGs. This folder also contains the script to run the classifier and analyze those results on the second dataset, whose genetic ancestry is unknown.
-
Comparing SVM vs glmnet: This script was used to compare glmnet and SVM.
-
Functional Analysis: for the functional analysis of the CpG sites prioritized by glmnet and limma.
-
-
Results contain a summary of our main findings.