Skip to content
/ MPP Public

Implementation of the Most Probable Path (MPP) algorithm for the coarse-graining of state trajectories, e.g. as obtained from Molecular Dynamics simulations.

License

moldyn/MPP

Repository files navigation

The Most Probable Path Algorithm for Reducing the Number of States in a State Trajectory

This package implements the most probable path (MPP) algorithm, which is used to coarse-grain the number of discrete states of a Markov process. Based on a microstate trajectory, a Markov state model is estimated utilising msmhelper. The transition probabilities and optional other descriptors are used to combine states in such a way that the final macrostates exhibit a given minimum population and metastability.

In the most basic example, a lumping tree is generated by iteratively selecting the least stable state and merging it with the state, to which its transition probability is highest. In a second step, the tree is parsed in reverse order (starting at the root) in order to identify the macrostates. For details, see publication tbd.

Features

  • Perform the most probable path (MPP) algorithm on a given microstate trajectory
  • Variety of analysis plots
  • Multi trajectory support
  • Extensions to the basic algorithm
    • Similarity by Kullback-Leibler divergence of transition probabilities
    • Incorporation Jenson-Shannon divergence of a feature (e.g. contact distances)
    • Stochastic lumping
  • Three levels of user interface
    • as integration in your Python code (use the MPP.Lumping or MPP.run.Data objects)
    • as Python module (python -m MPP.run --help)
    • in your Snakemake workflow (see ´workflow`)

Installation

The package is available in the Python Package Index (PyPI) and can be installed via pip:

python -m pip install mpp-lumping

Caveat:

At the moment of the development, the bezier package didn't support python 3.13, which is why the package was only tested with Python 3.12.11.

Usage

Dependent on your skills and needs, the module can be used at three different entry points:

  • Integrate the central MPP.Lumping or MPP.run.Data object in your Python pipeline.
  • Use the high-level Python interface (python -m MPP.run ...) via the command line.
  • In a Snakemake workflow where you only need to provide the configuration of your system and you're ready to go.

Config File

Config files (YAML files) are used to pass the information of where the files are located and some lumping parameters. Below you see a reference config file with all possible parameters. Note that only the following fields are mandatory: source, microstate trajectory, multi feature trajectory, frame length, lagtime, pop_thr, q_min. Please refer to the wiki for a detailed description of the parameters (tbd).

source: data/HP35/input/ # root directory of the other files

microstate trajectory: microstate_trajectory # the microstate trajectory
multi feature trajectory: contact_distances_trajectory # the feature trajectory, each line contains the feature values of the respective feature
cluster file: clusters # Defines the contact clusters, each row contains the contact indices of a contact cluster
contact index file: contacts.ndx # residue indices for contacts.
limits: null # contains the lengths of each trajectory when multiple trajectories are concatenated

topology file: structure.pdb # topology file used with the xtc trajectory
xtc file: trajectory.xtc # the xtc trajectory file
helices: helices # definition of secondary structure elements

contact threshold: null # Threshold for in the feature space below which e.g. a contact is considered to be formed.
frame length: 0.2 # in ns / frame
lagtime: 50 # in frames
pop_thr: 0.005 # population threshold for macrostates
q_min: 0.5 # minimum metastability of macrostates

n timescales: 3 # number of timescales to plot in the implied timescales plot. 3 is the default.

# For stochastic lumping
stochastic:
  method: n
  param: 2
  n: 10

# PyMol rendering
view: view # contains the view information for PyMol
width: 500 # width of the image in px
height: 500 # height of the image in px

Python Module

Running the package requires a configuration file as described above as a first parameter. The following two positional arguments define the similarity between two states. First comes the dynamic similarity (T, KL, none), second the geometric similarity (JS, none). For reference, G-PCCA can be performed by issuing gpcca first and then the number of macrostates to create. Pass ref in order to take the number of macrostates from the T none lumping (the similarity between states corresponds to the transition probability between them).

Provide the target file (where to store the plot) with the option -o. The lumping tree (the result of the first, potentially intense step) is defined by a Z matrix, which can be stored and loaded by the -Z option. If the provided path exists, this file is loaded as Z matrix. For the --rmsd option, the same holds true as it is intense to calculate. With the --rmsd-feature option, you can select if the RMSD should be calculated for the C alpha atoms in the Cartesian coordinate space or the space of your feature, e.g. contact distances. -r allows you to draw N random frames of each macrostate and -p produces desired plots. --get-least-moving-residues saves the indices in the state trajectory of the least moving residues per macrostate to a text file. This allows to find the residues, which participate in the most stable contact distances for each macrostate.

~$ python -m MPP.run --help
usage: Perform MPP on MD simulation data [-h] [-o OUT] [-Z Z] [--rmsd RMSD]
                                         [--rmsd-feature RMSD_FEATURE] [--xtc-stride XTC_STRIDE] [-r N]
                                         [-p PLOT]
                                         [--get-least-moving-residues GET_LEAST_MOVING_RESIDUES]
                                         data_specification d g

This program allows for the analysis of MD data utilizing the most probable path algorithm. It allows
for easy plotting of different quality measures.

positional arguments:
  data_specification    yaml file containing specification of files and parameters of the simulation
  d                     dij to be used.
  g                     gij to be used.

options:
  -h, --help            show this help message and exit
  -o OUT, --out OUT     Where to store the plot
  -Z Z                  Perform MPP and write the Z matrix
  --rmsd RMSD           Generate and write RMSD to file
  --rmsd-feature RMSD_FEATURE
                        'CA' for C-alpha RMSD or 'feature' for feature RMSD (default: CA)
  -r N, --draw-random N
                        Draw N random frames for each macrostate
  -p PLOT, --plot PLOT  Generate listed plots. Possible arguments include dendrogram, timescales,
                        sankey, contacts, macrotraj, ck_test, rmsd, delta_rmsd, state_network,
                        macro_feature, stochastic_state_similarity, relative_implied_timescales,
                        transition_matrix, transition_time and macrostate_trajectory. The latter writes
                        the macrostate trajectory to a txt file.
  --get-least-moving-residues GET_LEAST_MOVING_RESIDUES
                        Write least moving residues for each macrostate to a file

Your can try the example in the GitHub repository by downloading the example directory, navigate into it and try a command like

python -m MPP.run sample_system/input/config.yml T none -Z sample_system/results/t/Z.npy -p dendrogram -o sample_system/results/t/dendrogram.pdf

Make sure that the source is properly set according to your working directory (you need to remove example when you are inside the directory). Please note that not all functions of the package work here because the sample system is only a mock up.

The Snakemake Workflow

Snakemake is a workflow organization tool and used here to provide a high level user interface. In order to use it, prepare a conda environment with Snakemake installed:

~$ conda create -c conda-forge -c bioconda -c nodefaults -n snakemake snakemake

Make sure that GROMACS is callable (gmx command) and your ready to go (no need to install MPP manually).

Then copy the workflow directory to the same location as your data directory:

├── data/
│   ├── system1/
│   │   ├── input/
│   │   └── results/
│   ├── system2/
│   │   ├── input/
│   │   └── results/
│   └── ...
└── workflow/
    ├── mpp.yml
    ├── Snakefile
    └── ...

The Snakemake workflow requires this directory structure but the name of the data directory can be freely chosen. Only make sure to set it correctly in the Snakefile (data_root = "your_data_directory") and the config file of each system (source is the whole path to input directory). To create files with Snakemake, you only need to tell which file(s) you would like to create, e.g.

snakemake --cores 'all' --sdm conda -p example/sample_system/results/{t,t_js,kl,kl_js}/dendrogram.p{df,ng} --cache

creates dendrograms for four different lumpings as .pdf and .png files.

Explanation of some flags:

  • --cores Number of cores to utilize. 'all' for all cores.
  • --software-deployment-method, --sdm Use conda to deploy software environment.
  • --dry-run, --dryrun, -n Do not execute anything, just print out the jobs that would be run.
  • --cache So rules may be eligible for caching. Enable it with this option.
  • --force, -f Force recreation of the given file(s).
  • --printshellcmds, -p Print out the shell commands that are executed.

More information can be found here: Snakemake Documentation

Bash parameter expansion (the use of { and }) is possible to create e.g. several diagrams at once for multiple systems and/or setups.

About

Implementation of the Most Probable Path (MPP) algorithm for the coarse-graining of state trajectories, e.g. as obtained from Molecular Dynamics simulations.

Resources

License

Stars

Watchers

Forks

Packages

No packages published