Skip to content

AtteAalto/BINGO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BINGO

This is the code for BINGO (Bayesian Inference of Networks with Gaussian prOcess dynamical models), a method for gene regulatory network inference from time series data. It is introduced in the article

A. Aalto, L. Viitasaari, P. Ilmonen, L. Mombaerts and J. Goncalves. "Gene regulatory network inference from sparsely sampled noisy data", Nature Communications 11: 3493 (2020).

Please cite the article if you use the method in your project. Method is tested with Matlab 2017a.

General information

BINGO assumes that the discrete time series data is sampled from a continuous time trajectory x, satisfying dx/dt = f(x) + u. The function f is modelled as a Gaussian process (GP) and the network structure is encoded in the hyperparameters of the covariance of the GP. The mean of the i th component of f is mi(x) = bi - aixi. The method is based on MCMC sampling, meaning that it collects a large number of network samples. If a particular link appears in 60% of the samples, that is the posterior probability for the existence of that link, given the data and the underlying assumptions. It is recommended that at least 5000 samples is collected. Nine out of ten samples are discarded (thinning), so that 5000 samples requires 50000 iterations. It is possible to run a smaller amount of iterations in the beginning to check that the sampler works, and then run more samples. Sampling is fairly slow even for small systems, but the computation time scaling is rather good as the dimension grows.

Basic use

The code consists of two main functions, BINGO_init, that initialises the MCMC sampler and does some data processing. The sampling is done in the function BINGO. The inputs to this function are the data, given as a Matlab structure as described below, the state of the sampler (initialised by BINGO_init), and the sampling parameters (initialised by the BINGO_init, more information below). The output of the function consists of a matrix Plink, which is an n x n matrix where n is the number of genes. An element (i,j) of the matrix Plink tells the number of the collected network samples that contain the link j —> i. The function also returns the number of the collected samples, which is called chain. The posterior probabilities of links are obtained by dividing Plink/chain. The output variable xstore contains the posterior mean of the continuous trajectory. The output variable state contains the current state of the sampler (needed if the sampling is continued, for example). The output variable stats contains statistics of the MCMC sampling, which are also displayed in the command window after the sampling is finished.

The main file (BINGO_main) used for running the simulations contains first a burn-in, that is, running the sampler for a while and discarding all the samples. Then it contains a sampling part, and finally a part continuing the sampling process. This makes it possible to run some iterations first to get some idea, and then continue the sampling to improve the results. Notice that the method is also easily parallelisable. Just run the method on different processors, and in the end, sum up the Plink matrices. A simple example dataset is provided together with an example file (BINGO_example) where the various functionalities of the method (described below in this document) are presented.

WARNING!

In the data pre-processing (in the file BINGO_init), the time series data is scaled and moved so that the minimum value of each variable is 0 and the maximum is 1. This scaling is equivalent to giving scale-dependent priors to the hyperparameters beta. However, if the data contains genes that are not really expressed and contain only noise, this scaling leads to noise amplification. Therefore, the data should be checked and selected beforehand, and genes that are not at all expressed should be discarded. If, however, a gene is expressed in one experiment, and not in another, then it should be kept. Reducing the number of genes will also speed up the algorithm.

Data structure

The data is given in a Matlab structure called data. The time series data is included as a cell field data.ts containing different experiments. Each cell corresponds to one experiment, and should consist of a matrix whose number of rows is the number of genes, and the number of columns is the number of time points in that experiment.

Example:

Number of genes is n. Two experiments have been conducted with m1 time points in the first experiment, and m2 time points in the second experiment. The data is in matrices y1 (n x m1) and y2 (n x m2). Then the data.ts field is generated as follows:

data.ts = {y1 , y2};

If your data is uniformly sampled, with no external input, no missing measurements, no prior knowledge of the network, and none of the time series correspond to knockout/knockdown experiments, you are good to go. If this is not the case, read further. BINGO can be applied on pseudotime series as well, but the data structure is slightly different.

Sampling times

If the sampling times are not explicitly given by the user, uniform sampling (with sampling time 1) is assumed. If the sampling is not uniform (or if different sampling frequency is used in the different experiments), this can be specified in the cell field data.Tsam.

Example:

Possible inputs for the cell field Tsam are:

  • Constant sampling time intervals (dt) for all experiments: data.Tsam = {dt}
  • Different constant sampling time intervals (dt1 and dt2) for the two experiments: data.Tsam = {dt1 , dt2};
  • Completely non-uniform sampling times: data.Tsam = {[t1_1, t1_2 , … , t1_m1] , [t2_1, t2_2 , … , t2_m2]}. In this case, the number of given time points must match the size of the corresponding measurement matrix.
  • A combination of the two above: data.Tsam = {[t1 , t2 , … , tm1] , [dt2]};

Additional information:

The main function BINGO wants the field Tsam in the format of a cell field where each cell has all the measurement times (corresponding to the case of completely non-uniform sampling). If the format is something else, then the initialisation file BINGO_init changes the format. However, if the user gives the sampling times in some other form, an error message will be displayed.

Explicit input (perturbations etc.)

Explicit inputs can be given in the field data.input. Its use is best explained through examples.

Example:

Say m1 = m2 = 8. Consider the following example cases:

  1. The second experiment is a perturbation experiment, where some unknown perturbation is applied half way through the experiment. This kind of input is fed as follows:
    data.input = {zeros(1,8) , [0 0 0 0 1 1 1 1]};
  2. Both experiments are perturbation experiments with two different perturbations. This is given as follows:
    data.input = {[0 0 0 0 1 1 1 1; zeros(1,8)] , [zeros(1,8); 0 0 0 0 1 1 1 1]};
  3. Both experiments have the same perturbation. This is given as follows:
    data.input = {[0 0 0 0 1 1 1 1] , [0 0 0 0 1 1 1 1]};
  4. In the first experiment, a perturbation is applied at time zero, and in the other experiment, there is no perturbation:
    data.input = {ones(1,8) , zeros(1,8)};

However, if there is only one experiment with a perturbation applied at time zero, then an input of the form {ones(1,8)} does not have any effect, and it should not be included. In such case, assuming the system was in steady state before the perturbation, it is possible to duplicate the first measurement writing

data.ts = {[y(:,1) , y]};

and then giving input

data.input = {[0 1 1 . . . 1]};

  1. The input can also be a continuous variable, such as temperature:

    data.input = {[T1_1 , T1_2, …, T1_8] , [T2_1 , T2_2, …, T2_8]};

and so on.

When external input is given, the size of the output matrix Plink is n x (n + n_in) where n_in is the dimension of the input. The method also tells which genes does the perturbation affect directly. If the perturbation target is known, read the section “Prior knowledge of the network” below.

Additional information:

The above warning concerns also external inputs. For example, if the temperature variation, and its effect on the process is negligible, then it should not be included as a possible input.

Missing measurements

If the measurement of some genes fail at some time points, this can be included in the information given to the method in a field data.missing. Each cell in this field should be a matrix with size N_miss x 2, where each row corresponds to a missing measurement of a gene. The first element on the row identifies the gene in question, and the second element tells the index of the time point where the measurement is missing. The values will be ignored in the method, so it does not matter what is the actual value of the element in the original data matrix.

Example:

Say the measurement of gene 2 failed at the 2nd and 3rd measurement times, and in the second experiment, the measurement of gene 3 failed in the 7th time point and the measurement of gene 5 failed in the first and fourth time points. This is given as follows:

data.missing = {[2 2; 2 3] , [3 7; 5 1; 5 4]};

Additional information:

Missing measurements are treated so that the Crank—Nicolson sampler draws Brownian bridge samples between the previous and next non-missing sample (corresponding precisely to the measurement model where the missing measurements are omitted). This is achieved by first replacing missing measurements by their linear interpolates (in the file BINGO_init), and then increasing the variance of the corresponding measurement noise in the BINGO file.

Prior knowledge of the network

If some links are certainly known to exist or not to exist, this information can be encoded in the field data.sure. This should be an n x n matrix (or n x (n + n_in) where n_in is the input dimension if one also has prior knowledge on the targets of potential inputs) with ones in the positions indicating links that are known to exist, minus ones on the places known not to exist, and zeros elsewhere. Such prior information improves the results and speeds up the sampling as the number of candidate networks decreases.

Example:

In a 3 x 3 network, it is known for sure that gene 3 is a regulator of gene 1. In addition, there is one perturbation (external input) which is known to affect only gene 2. This is written as

data.sure = [0 0 1 -1; 0 0 0 1; 0 0 0 -1];

Links can also be given different Bayesian prior probabilities of existence. The prior proabilities are given in a field parameters.link_pr. This should be a matrix of size n x n matrix (or n x (n + n_in) where n_in is the input dimension. If a certain link exists with (prior) probability p, then the corresponding element in link_pr matrix is p / (1-p). If a link is known to exist or to not exist for sure, it is advised to use the data.sure field, since that will speed up the MCMC sampling.

Knockout time series

If some time series correspond to gene knockout/knockdown experiments, then that time series should not be taken into account in the inference of links pointing to the knocked-out/down gene. Such information is included in the field data.ko. For example, if there are four time series, and the third corresponds to an experiment where the second gene is knocked out and the fourth time series corresponds to an experiment where the fourth and fifth genes are knocked out, it is given as follows:

data.ko = {[ ] , [ ] , [2] , [4 5]};

Pseudotime series

Note: The use of BINGO on pseudotime series was not discussed in the original publication, and no benchmarking has been done by the authors in this setup.

BINGO's trajectory sampling is designed to take into account the low sampling frequency of typical bulk gene expression measurements. However, the trajectory sampling makes sense also in the setup of single cell data after applying a pseudotime algorithm. Typically such data forms a fairly scattered point cloud. The trajectory sampling makes sense also in this case. BINGO will sample trajectories that pass through this point cloud. From computational point of view, the standard implementation of BINGO becomes too heavy (since it tries to interpolate the trajectory between each two pseudotime points), and therefore a modification is needed.

Method parameters

Method parameters are given to the BINGO-function as a structure called parameters. The values are set in the file BINGO_init, and it is not necessary to change these parameters. The parameter fields are:

  • link_pr: The prior for networks is that the existence of each link is independent of others. If a link exists with (prior) probability p, then link_pr = p / (1-p). This parameter controls the level of sparsity in the network samples. This can be specified by the user, but it is not necessary. For example, if there are 50 nodes, and it is expected that each node has about 2-3 (incoming/outgoing) links, then one should set, for example, p = 2.5 / 50. However, if the user does not provide the parameter, the default value is link_pr = 1/n. Note that this parameter can also be a matrix, if different priors are given for different links.
  • its: number of iterations.
  • nstep: number of steps in the “continuous” trajectory between the measurement times. Default value is 4. It can be reduced to 3 or even 2 if the sampling takes a very long time. If it is one, the method reduces to a kind of a discrete-time GPDM.
  • Theur: a temperature variable used in a heuristic scheme to speed up the topology sampling. If Theur is one, it corresponds to exact sampling. Higher Theur improves the acceptance probability of the topology sampling.

A number of different step size parameters can be set by the user. Shorter step sizes increase the acceptance rates of the MCMC sampling, but it has a negative effect on the efficiency of the exploration of the parameter space. Note that hyperparameters gamma, beta, a, and b are sampled together. If their acceptance probability drops, either one or more of these can have too long step sizes. Similarly, the trajectory is sampled together with the noise variance q.

  • etraj: step size for the Crank—Nicolson sampler for the continuous trajectory. Default value is 0.06, and it is recommended that this is at least 0.05. It can be adjusted if the acceptation probability for the trajectory becomes low, which may happen if there are several experiments in the data.
  • ea: step size for sampling the a-parameters in the mean function.
  • eb: step size for sampling the b-parameters in the mean function.
  • egamma: step size for sampling the gamma-parameter in the GP covariance.
  • ebeta: step size for sampling the beta-parameter in the GP covariance.
  • er: step size for sampling the measurement error variance r.
  • eq: step size for sampling the process noise covariance q.

Not part of the parameters-structure is nr_pi, which is the number of pseudoinputs used in the GP regression in the code. Higher number improves accuracy, but slows down sampling.

Troubleshooting

Problem: The acceptance probabilities for the trajectory and the topology are (almost) zero.

Possible solutions: The step size of the trajectory (parameters.etraj) can be reduced, but it should not be below 0.05. It is possible that the initial value for the process noise covariance is too small. Try to increase it by changing it inside the file BINGO_init (state.q). There is also a heuristic temperature variable (parameters.Theur) which is one by default. It can be increased to about 1.5 to speed-up topology sampling. However, this method is heuristic, and then the sampling is not exactly what is claimed.

Problem: The method is very slow.

Possible solutions:

  1. Consider parallelisation (see below).
  2. The gene expression trajectory is sampled on a finer grid where each measurement interval is divided into four parts (by default). This can be made coarser by changing nstep = 3 inside the function BINGO_init.
  3. Number of pseudoinputs can be reduced inside BINGO_init by setting nr_pi = 40 (for example, default is 50).
  4. Reduce dimension by throwing out genes whose expression data contains merely noise (which actually should be done anyway).
  5. If some genes are only interesting as potential regulators, they can be given to the method as inputs rather than as part of the time series data.

Parallelisation

  1. Generate the data-structure;

parfor-loop

  1. Run BINGO_init;

  2. The burn-in can be shortened by setting parameters.its = 1000 (for example, default is 3000);

  3. Run BINGO with a suitable number of iterations;

  4. Store results;
    end

  5. Sum up the different Plink matrices and divide by the total number of samples.

Updates on the method since publication of the article

March 22, 2021: Hyperparameter sampling is changed from random walk to random walk in log-domain to make it scale-independent (reverted 21 July, 2021).

April 21, 2021: Prior probabilities of links can be different.

July 21, 2021: Hyperparameter sampling in log-domain reverted. A bug in dealing with knockout time series fixed.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published