This repository is designed to cluster time-series Gene Expreession Matrix (GEM) based on the gene trajectory pattern.
As most biological processes are dynamic, time-series transcriptome experiments play a pivotal role in understanding and modeling these processes. Using K-Means to profiling the time-course transcriptional response is a common approach for bioinformaticians. TSGEM_Clustering will optimaize the KMeans performance on sequential time-series data by adapting Dynamic Time Warping Distance Metric(https://arxiv.org/abs/1703.01541). And it will select the optimal number of clusters based on the results from distortion (also known as elbow method),silhouette coefficient, and Calinski harabasz index.
All of TSGEM_Clustering's dependencies can be installed through Anaconda3. To create an Anaconda environment:
#Specific to Clemson's Palmetto Cluster
module load anaconda3/5.1.0-gcc/8.3.1
conda create -n TSGEM_Clustering python=3.6 matplotlib numpy pandas scikit-learn
Once the anaconda environment has been created, the tslearn
package must be installed seperately
source activate TSGEM_Clustering
conda install -c conda-forge tslearn
python DTW-perf.py -i Test/test.txt -kmin 2 -kmax 30 -step 2 -o Test_Results/Step1Test
python DTWKMeans.py -i Test/test.txt -k 6 -o Test_Results/Step2Clustering -p Test-K6
The Example outputs can be found in Auxiliary.