Alzheimer’s is a widespread, irreversible, progressive neurodegenerative disease, with a complex genetic architecture. The key goal of this project is to seek out disease risk genes and classify them as Alzheimer's Disease associated and unassociated.
Various machine learning algorithms have been used to predict candidate genes. Previous prediction methods can be roughly divided into five types-
-
Methods studying protein-protein interaction networks
-
Gene functional annotations
-
Sequence-based features patterns
-
Machine learning and network topological features
-
Information about tissue-specific networks
These methods predict associated genes or biomarkers. However, there are few reports on brain gene expression data. Accordingly, the research paper by Huang et al. on Revealing Alzheimer’s disease genes spectrum in the whole-genome by machine learning was used as a reference for this project.
The aim is to divide the genes into five classes, namely C1-AD: probable pathogenic genes, C2-AD: high confidence genes, C3-AD: related genes, and C4-AD: possibly associated genes.
-
Numpy
-
Scipy
-
Sklearn
-
Pandas
-
Pylab
-
Matplotlib
-
Itertools
Environment- Python 3.6, Windows 10
The dataset used in the above-mentioned research paper was taken from the AlzGene archive . The training features include number of positive and negative Alzheimer's cases in control studies and family-based studies for 335 genes.
The lack of sufficient data samples make it difficult to train the model. Accordingly, regularization has been used to prevent overfitting.
For training purposes, 33% of the data was used for testing.
The followed algorithms were trained on the given dataset-
-
Support Vector Machine using Radial Kernel
-
Support Vector Machine using Linear Kernel
-
Support Vector Machine using Polynomial Kernel
-
Support Vector Machine using Sigmoid kernel
-
Decision Trees
The algorithms were evaluated on micro average, macro average, and weighted average of their accuracy, precision, F-1 score and support results on the four predicted classes.
Of these, desicion trees gave the best accuracy (88.29%).
However, the highest Receiver Operating Characteristic (ROC) curve area of 0.78 was obtained using Support Vector Machine with Radial kernel.
Note- The results on Support Vector Machine using R library were provided in the paper and were not reproduced by us.