Skip to content

samanta-anupam/pheno-predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pheno-predictor

Comp Bio fall 2017 project

Phenotypic Prediction from Transcriptomic Features Project Description: You are provided with output from Salmon, an RNA-seq mapping and quantification tool, on a number of datasets. The various samples come from different phenotypes; types of cancers in this case. Hence, each dataset is given a label based on the originating cancer type. Your task in this project is to build a model that takes the Salmon output and predicts the original label. You can train your model using all or some subset of the input data but make sure to not overfit the model since we will be testing it on a different dataset. You can begin by selecting basic features from the Salmon output, such as TPM (transcripts per million), transcript lengths and counts. These are provided in the “quant.sf” file. However, we expect you will come up with a more complex feature selection method that will likely provide better results. We will provide a baseline that you must outperform. The goal is to achieve the maximum accuracy in classifying the samples using multi-class classification models. Ideally, you should test multiple feature selection methods and models. Your report should include all the results.

The special properties of this dataset make it difficult to immediately obtain good prediction accuracies by building a well-known classification model like SVM, Random Forest or NaiveBayes on it. Each phenotypic category has data points assigned to it (around 20 or fewer samples) while each data point has a huge number of raw features (At least equal to total number of known transcripts for human being in this case which is ~100k, but we have additional features as well). So the main challenge in this problem is selecting and generating few number of descriptive and discriminative features. For that you can first use some visualization- and statistical-based feature analysis (e.g. use a Decision Tree to find the most discriminative features that come at the top levels of the tree, compute the correlation matrix to find and remove highly correlated features) use different feature selection methodologies (e.g. those based on entropy), use kernel models or feature segmentation methods to create advanced features, and dimensionality reduction models to change the feature domain and extract principal components . After that you can apply a classification model to the reduced datasets and run a grid search to find the best fit parameters for your model. For evaluation of different models you can use ROC plot and F-score and Accuracy metrics. Input: output of Salmon Transcripts tpm, count, length Equivalence Classes for each data set Bias Models Output: A phenotypic Prediction Model for Cancer Detection Midway Result: 1) feature selection + 1 dimensionality reduction algorithm + 1 Multi-Classification Model that beats the baseline. Keywords: Transcript Quantification, TPM, Prediction, Salmon

Datasets : https://drive.google.com/drive/folders/0B_OfHrdJFRfdb3h3eEwyZ0FsY3M

Salmon docs : http://salmon.readthedocs.io/en/latest/file_formats.html#fileformats

Glossary:

  1. TPM = (transcripts per million) TPM = Total # of reads / (effective length * 1000000)

Running the model

  1. Following packages need to be installed:
pip install pandas
pip install pickle
pip install os
pip install itertools
pip install scikit-learn
pip install scipy
pip install matplotlib.pyplot
pip install numpy
pip install xgboost
  1. The model evaluation code can be run from the Jupyter Notebook- Model_Evaluation.ipynb
  2. Make the following changes in the Model_Evaluation.ipynb-
    • Change TEST_PATH to folders containing test files
    • Change CSV_PATH to point to the CSV_FILES
  3. Run the cells of the notebook
    • 2nd Last Cell signifies XGBOOST on TPM Count dataset
    • Last Cell signifies XGBOOST on Equivalence Classes as Features