-
Notifications
You must be signed in to change notification settings - Fork 5
Example Tutorial
This example will introduce the MERLoT package using a dataset that represents the embryonic development from the zygote state until the 64 cells state embryo, presented by Guo et al, 2010 (Here is the original article).
library(merlot)
Read the expression matrix of the Guo dataset that is distributed with the package:
DataFile= paste(find.package("merlot"),"/example/Guo2010.txt", sep="")
Dataset=ReadDataset(DataFile)
The expression matrix is an NxG
dimensional matrix with cells at the rows and gene expression values in the columns.
MERLoT works on manifold coordinates, produced by any dimensionality reduction algorithm. In this example we are going to use the destiny package in order to calculate a diffusion map from which we are going to use the first 3 coordinates. However other dimensionality reduction algorithms such as Monocle2 or SLICER can also be used to calculate manifold coordinates and use them as input for MERLoT.
library(destiny)
DatasetDM <- DiffusionMap(Dataset$ExpressionMatrix)
We keep the first 3 coordinates from the calculated diffusion map.Note that while 3 dimensions are enough for this example, they may not be enough for more complex cases. The more complex a topology is, the more dimensions are usually needed to capture it. Technically, a topology with n
bifurcations can be represented in at most n-1
dimensions.
CellCoordinates=DatasetDM@eigenvectors[,1:3]
Given the coordinates from the manifold, MERLoT will first calculate a scaffold tree, whose skeleton will go through different cells in the dataset. This scaffold tree is of fundamental importance to define the tree topology (i.e. location of endpoints, branchpoints and how they are connected) and subsequently initialize the next step where a principal elastic tree will be calculated.
ScaffoldTree=CalculateScaffoldTree(CellCoordinates = CellCoordinates)
The scaffold tree can be visualized using the following function:
plot_scaffold_tree(ScaffoldTree = ScaffoldTree)
Note1: Automatic visualization of the scaffold tree is only possible for 2 or 3 dimensions.
Note2: The scaffold tree function internally uses a python script. In this stage it is critical that the python installation that is used by the active R instance is the one that has all the dependencies installed. If not (e.g. you are using a conda environment), you need to provide the location of the correct python binary to the python_location argument in the CalculateScaffoldTree() function as in the following example:
ScaffoldTree=CalculateScaffoldTree(CellCoordinates = CellCoordinates, python_location = "/usr/bin/python3")
Having obtained the scaffold tree, it can be used to initialize and calculate a principal elastic tree. Given the tree topology calculated in the previous step, the principal elastic tree will approximate the distribution of cells in the given manifold by minimizing the approximation error between the data and the nodes in the tree and minimizing an energy function. More information about this tool is readily available in its github repository.
First, the user needs to set the number of nodes that will be use to create the elastic principal tree
NumberOfNodes=100
We calculate the elastic principal tree using the scaffold tree for its initialization
ElasticTree= CalculateElasticTree(ScaffoldTree = ScaffoldTree, N_yk = NumberOfNodes)
The method will try to accommodate all N_yk nodes on the tree. Using more nodes increases the resolution of gene expression but it also decreases the number of real cells assigned to each node. Using too many nodes presents two main difficulties: it will either force the tree to assume an unnatural topology in order to accommodate the additional nodes, or it will keep a correct overall topology but assign very few cells to each node. Too many nodes will result in a noisier reconstructed gene expression profile as a function of pseudotime (see next section). On the other hand, using too few nodes will mean that the subtle differences that drive differentiation will be drowned by averaging over very coarse groups. The number of nodes in the elastic tree can be viewed as the resolution of a grid in the (pseudo)temporal coordinate that explains the evolution of the process.
In order to produce the resulting tree, the algorithm minimizes an elastic energy function that mainly depends on two parameters, mu and lambda (see this (see this link for a detailed explanation). The values of lambda and mu that will produce balanced trees change according to how many nodes the tree consists of. We propose default values for mu (mu_0 = 0.00625) and lambda (lambda_0 = 2.03e-09) that produced balanced trees with 100 nodes on real datasets. These values, while robust, are not guaranteed to work for every dataset. Users can adjust their values in the CalculateElasticTree
function call:
CalculateElasticTree(ScaffoldTree, N_yk)
In case lamba_0 and mu_0 are aimed to be changed:
CalculateElasticTree(ScaffoldTree, N_yk, lambda_0 = new_lambda_0, mu_0 = new_mu_0)
Plot the principal elastic tree:
plot_elastic_tree(ElasticTree, colorcells = NULL)
a vector containing colors for the cells can be given to the function. If no vector is provided an internal palette of colors will be used for the different branches.
In case very complex trees that required the use of many dimensions, a 2-d plot function can be used to visualize the data. It is also helpful in order to identify the internal numbering of endpoints, branchpoints and branches in the ElasticTree data object.
plot_flattened_tree(ElasticTree, legend_position="bottomright")
Ideally the lineage of a differentiation process would be studied without dimensionality reduction, in order to preserve the information inherent in gene expression. This is unfeasible because of the noise levels, but the low-dimensional principal tree lets us circumvent this limitation. The low-dimensional tree can be transformed to one in the full gene expression space by mapping nodes as intermediate points of the cells that were assigned to them in manifold space.
The gene expression profiles for the nodes in the elastic tree will be reconstructed by embedding the tree structure in the ExpressionMatrix space.
EmbeddedTree= GenesSpaceEmbedding(ExpressionMatrix = Dataset$ExpressionMatrix, ElasticTree = ElasticTree)
Pseudotime is an abstract concept quantifying the progress of a cell through differentiation. Contrary to wall time, the pseudotime of a cell is measured in relative terms, as its distance from an arbitrary starting point in the manifold. In the context of a tree this progress can be measured as distance from any node - be it an end point (T0=
) or a particular cell (C0=
). Expert knowledge about the process can be incorporated easily in the calculations by choosing the correct starting point for pseudotime calculation.
Calculate pseudotimes for the nodes in the Tree in the low dimensional space or in the full gene expression space:
Pseudotimes=CalculatePseudotimes(ElasticTree, T0 = 1)
Pseudotimes=CalculatePseudotimes(EmbeddedTree, T0 = 1)
T0 tells the function which endpoint in the elastic tree will be considered as the initial time point. C0 can be used instead of T0 to indicate which cell in the dataset will be considered as the initial time point.
The pseudotimes for the cells can be visualized using:
plot_pseudotimes(CellCoordinates, Pseudotimes)
Plot gene expression profile as a function of pseudotime:
plot_pseudotime_expression_gene(GeneName = "Gata4" , EmbeddedTree = EmbeddedTree, Pseudotimes = Pseudotimes, addlegend = T)
Plots the averaged expression and the interpolated expression matrices. The function also returns both matrices in a list object.
OrderedMatrix=plot_heatmaps_embedding(Pseudotimes, EmbeddedTree, log_tranform=F)
Finding genes that are differentially expressed between two sub-populations in the tree is easy. First, define the sub-populations (e.g. two branches of the tree):
Group1=EmbeddedTree$Branches[[1]]
Group2=EmbeddedTree$Branches[[2]]
DifferentiallyExpressedGenes=subpopulations_differential_expression(SubPopulation1 = Group1, SubPopulation2 = Group2, EmbeddedTree = EmbeddedTree, mode = "cells")
This function applies the Kruskal-Wallis test on the two defined sub-populations and returns all genes, sorted according to their p-values. The returned object contains the GenesNames, p-values, e-values and original order index in the Gene List (taken from the colnames
in the provided ExpressionMatrix
).
The same test can be applied to find genes that are differentially expressed in a specific branch:
Branch1Genes=branch_differential_expression(Branch=1, EmbeddedTree)
Branch2Genes=branch_differential_expression(Branch=2, EmbeddedTree)
This function is based on subpopulations_differential_expression
. It defines one branch as one sub-population and the rest of the tree as the other, and performs the Kruskal-Wallis test on them. The return value is similar to subpopulations_differential_expression
.