Skip to content

alncat/opusDSD

Repository files navigation

New updates!

We have developed a structural heterogeneity resolving method for cryo-ET, OPUS-TOMO. Welcome to check out https://github.com/alncat/opusTomo!

Table of contents

  1. Opus-DSD2
    1. Spliceosome
    2. Covid Spike
    3. 80S ribosome
  2. setup environment
  3. prepare data
  4. training
    1. train_cv
    2. train_multi
  5. analyze result
    1. sample latent spaces
    2. reconstruct volumes
    3. select particles

Opus-DSD2

This repository contains the implementation of opus-deep structural disentanglement2 (DSD2), which is developed by the research group of Prof. Jianpeng Ma at Fudan University. The preprint of OPUS-DSD2 is available at https://www.biorxiv.org/content/10.1101/2024.09.24.614854v1 and https://drive.google.com/drive/folders/1tEVu9PjCR-4pvkUK17fAHHpyw6y3rZcK?usp=sharing, while the publication of OPUS-DSD is available at https://www.nature.com/articles/s41592-023-02031-6. An exemplar movie of the OPUS-DSD2 is shown below:

mov3.mp4

OPUS-DSD2 also greatly improves the quality of its reconstructions! image

The major new functionality of OPUS-DSD2 is reconstructing multi-body dynamics from cryo-EM data end-to-end during structural disentanglement! OPUS-DSD2 can not only disentangle 3D structural information by reconstructing different conformations, but also reconstruct physically meaningful dynamics for the macromolecules. This new function is very easy to use if you have already been familiar with Relion's multi-body refinement (https://elifesciences.org/articles/36861). OPUS-DSD2 takes the input files of Relion's multi-body refinement, then performs structural disentanglement and multi-body dynamics fitting simultaneously and end-to-end!

This program is built upon a set of great works:

This project seeks to unravel how a latent space, encoding 3D structural information, can be learned by utilizing only 2D image supervisions which are aligned against a consensus reference model.

An informative latent space is pivotal as it simplifies data analysis by providing a structured and reduced-dimensional representation of the data.

Our approach strategically leverages the inevitable pose assignment errors introduced during consensus refinement, while concurrently mitigating their impact on the quality of 3D reconstructions. Although it might seem paradoxical to exploit errors while minimizing their effects, our method has proven effective in achieving this delicate balance.

The workflow of OPUS-DSD is demonstrated as follows: image

Note that all input and output of this method are in real space! (Fourier space is good, but how about real space!) The architecture of encoder is (Encoder class in cryodrgn/models.py): Alt text

The architecture of composition decoder is (ConvTemplate class in cryodrgn/models.py. In this version, the default size of output volume is set to 192^3, I downsampled the intermediate activations to save some gpu memories. You can tune it as you wish, happy training!):

Alt text

The architecture of dynamics decoder is:

image

Spliceosome complex

OPUS-DSD2 has superior structural disentanglement ability to encode distinct compositional changes into different PCs in composition latent space.

mov1.mp4

Covid Spike Protein

The open and close of S1 region resolved by OPUS-DSD2 multi-body dynamics, PC3 shows the dynamics to open S1 region, while PC1 shows the dynamics to close S1 region. Red arrows indicate the directions of movements of S1 subunits. Note that these two modes are orthogonal, and have similar patterns of the vibrational dynamics of water molecule H_2O https://strawberryfields.ai/photonics/apps/run_tutorial_dynamics.html

covdefpc3.mp4

The conformational changes inside the NTD of S1 region and S2 region resolved by OPUS-DSD2

covpc3.mp4

80S ribosome

The following results are from the legacy OPUS-DSD. An exmaple UMAP of latent space for 80S ribosome learned by the legacy OPUS-DSD:

Alt text

Comparison between some states: Alt text

A more colorful umap is shown above. The particles are colored according to their projection classes, note that the clusters often show certain dominant colors, which is due to structural variations in images are accounted in the consensus refinement by distorting the pose paramters of each particle, like fitting a longer rod into a small gap by tilting the rod!

Alt text

Data source: EMPIAR-10002. The particles are colored according to their pose parameters in this image.

set up environment

After cloning the repository, to run this program, you need to have an environment with pytorch and a machine with GPUs. The recommended configuration is a machine with 4 V100 GPUs. You can create the conda environment for DSD using one of the environment files in the folder by executing

conda env create --name dsd -f environmentcu11torch11.yml

This environment primarily contains cuda 11.3 and pytorch 1.11.0. To create an environment with cuda 11.3 and pytorch 1.10.1, you can choose environmentcu11.yml. Lastly, environment.yml contains cuda 10.2 and pytorch 1.11.0. On V100 GPU, OPUS-DSD with cuda 11.3 is 20% faster than OPUS-DSD with cuda 10.2. However, it's worth noting that OPUS-DSD has not been tested on Pytorch version higher than 1.11.0. We recommend using pytorch version 1.10.1 or 1.11.0. After the environment is sucessfully created, you can then activate it and execute our program within this environement.

conda activate dsd

You can then install OPUS-DSD by changing to the directory with cloned repository, and execute

pip install -e .

OPUS-DSD can be kept up to date by

git pull

The inference pipeline of our program can run on any GPU which supports cuda 10.2 or 11.3 and is fast to generate a 3D volume. However, the training of our program takes larger amount memory, we recommend using V100 GPUs at least.

prepare data

This program is developed based on cryoDRGN and adheres to a similar data preparation process.

Data Preparation Guidelines:

  1. Cryo-EM Dataset: Ensure that the cryo-EM dataset is stored in the MRCS stack file format. Suitable datasets for tutorial are the covid spike protein which is available at: https://pan.baidu.com/s/1PAs7uaUIIeyqegq3sdciYg?pwd=v24j (It contains all required files for training and trained weights), source (https://empiar.pdbj.org/entry/10492)

and the spliceosome which is available at https://empiar.pdbj.org/entry/10180/ (It contains the consensus refinement result.)

  1. Consensus Refinement Result: The program requires a consensus refinement result, which should not apply any symmetry and must be stored as a Relion STAR file. Other 3D reconstruction results such as 3D classification, as long as they determine the pose parameters of images, can also be supplied as input.

Usage Example:

In overall, the commands for training in OPUS-DSD can be invoked by calling

dsd commandx ...

while the commands for result analysis can be accessed by calling

dsdsh commandx ...

More information about each argument of the command can be displayed using

dsd commandx -h 

or

dsdsh commandx -h

Data Preparation for OPUS-DSD Using dsdsh prepare:

There is a command dsdsh prepare for data preparation. Under the hood, dsdsh prepare points to the prepare.sh inside analysis_scripts. Suppose the version of star file is 3.1, the above process can be simplified as,

dsdsh prepare /work/consensus_data.star 320 1.699 --relion31
                $1                      $2    $3    $4
  • $1 specifies the path of the starfile,
  • $2 specifies the dimension of image
  • $3 specifies the angstrom per pixel of image
  • $4 indicates the version of starfile, only include --relion31 if the file version is higher than 3.0

The pose and ctf pkls can be found in the same directory of the starfile, in this case, the pose pkl is /work/consensus_data_pose_euler.pkl, and the ctf pkl is /work/consensus_data_ctf.pkl

Next, you need to prepare a image stack. Suppose you have downloaded the spliceosome dataset. You can prepare a particle stack named all.mrcs using

relion_stack_create --i consensus_data.star --o all --one_by_one

Sometimes after running some protocols in Relion using all.star, Relion might sort the order of images in the corresponding output starfile. You should make sure that the output starfile and the input all.star have the same order of images, thus the output starfile have the correct parameters for the images in all.mrcs!

Finally, you should create a mask using the consensus model and RELION through postprocess. The detailed procedure for mask creation can be found in https://relion.readthedocs.io/en/release-3.1/SPA_tutorial/Mask.html. The spliceosome dataset on empiar comes with a global_mask.mrc file. Suppose the filename of mask is mask.mrc, move it to the program directory for simplicity.

Data Preparation for OPUS-DSD2 with Composition and Dynamics Disentanglement

OPUS-DSD2 features a new capacity to reconstruct multi-body dynamics and resolving compositional heterogeniety. Reconstructing multibody dynamics in OPUS-DSD2 is very similar to Relion's multibody refinement protocol though the underlying dynamics model is different (You can see the details in the preprint). First of all, you shall create a set of masks following Relion's multibody refinement protocol. You can find examples about masks and input starfile for the multibody refinement of spliceosome in https://empiar.pdbj.org/entry/10180/. There is also a tutorial with detailed process for creating masks in https://www.cryst.bbk.ac.uk/embo2019/pracs/RELION%20practical%20EMBO%202019_post%20practice.pdf . The segment map tool in ChimeraX is also perfect for creating segmentations.

After the masks and the starfile for multibody refinement are created, you can prepare the pkls for multibody dynamics estimation by executing

dsdsh prepare_multi starfile D apix masks numb --volumes VOLUMES

The details about each argument can be checked using dsdsh prepare_multi -h The prepare_multi commands will create a pkl file that contains the parameters of defined bodies, which will be stored in the same directory as the starfile. The translation of each body is defined using the rotation around its reference body. The magnitude of its translation then is the magnitude of rotation of the center of this body in relative to its reference body. OPUS-DSD2 will read the reference bodies from the starfile, which are specified in _rlnBodyRotateRelativeTo. But you should note that the index of body starts from 1. The body that occurs most often as the reference body for others will be selected as the translation-free center. The translation of center will always be set to zero. The direction of rotational axis for the translation of multi-bodies can be determined using the volume series found by PCA analysis for the OPUS-DSD's result (You can specify the volume series by giving the directory of volume series using --volumes), since they represents a possible mode of movement. Otherwise, the direction of rotational axis will be aligned to the displacement between the center of a body and the center of its center body.

After executing these steps, you have all pkls and files required for running opus-DSD2.

We provided a mask pkl file for covid spike protein in : https://pan.baidu.com/s/1PAs7uaUIIeyqegq3sdciYg?pwd=v24j , which is named as mask_test.pkl. The segmentation of covid spike protein is shown below:

seg

We provided a mask pkl file for spliceosome using the EMPIAR deposited definiton of multi-bodies in https://drive.google.com/file/d/19fKECY3BDNboXmRUp0fPo9YJf3QarR_u/view?usp=drive_link . In the folder from link https://drive.google.com/drive/folders/1tEVu9PjCR-4pvkUK17fAHHpyw6y3rZcK?usp=sharing, you can also find the pose and ctf pkl and the global mask, which are named as consensus_data_pose_euler.pkl, consensus_data_ctf.pkl and mask.mrc, respectively. We also deposited a trained weight and latent encoding file for visualzation in the same folder. The segmentation of spliceosome given by deposited result is shown below.

image16

Data preparation under the hood

OPUS-DSD follows cryoDRGN's input formats. The pose and ctf parameters for image stack are stored as the python pickle files, aka pkl. Suppose the refinement result is stored as consensus_data.star and the format of the Relion STAR file is below version 3.0, and the consensus_data.star is located at /work/ directory, you can convert STAR to the pose pkl file by executing the command below:

dsd parse_pose_star /work/consensus_data.star -D 320 --Apix 1.699 -o sp-pose-euler.pkl

where

argument explanation
-D the dimension of the particle image in your dataset
--Apix is the angstrom per pixel of you dataset
-o followed by the filename of pose parameter used by our program
--relion31 include this argument if you are using star file from relion with version higher than 3.0

Next, you can convert STAR to the ctf pkl file by executing:

dsd parse_ctf_star /work/consensus_data.star -D 320 --Apix 1.699 -o sp-ctf.pkl

For the RELION STAR file with version hgiher than 3.0, you should add --relion31 to the command!

training

train_cv for OPUS-DSD

When the inputs are available, you can train the vae for structural disentanglement proposed in OPUS-DSD's paper using

dsd train_cv /work/all.mrcs --ctf ./sp-ctf.pkl --poses ./sp-pose-euler.pkl --lazy-single --pe-type vanilla --encode-mode grad --template-type conv -n 20 -b 12 --zdim 12 --lr 1.e-4 --num-gpus 4 --multigpu --beta-control 2. --beta cos -o /work/sp -r ./mask.mrc --downfrac 0.75 --valfrac 0.25 --lamb 1. --split sp-split.pkl --bfactor 4. --templateres 224

The argument following train_cv specifies the image stack. The three arguments --pe-type vanilla --encode-mode grad --template-type conv ensure OPUS-DSD is selected! Our program set the default values of those arguments to the values shown in above command.

The functionality of each argument is explained in the table:

argument explanation
--ctf ctf parameters of the image stack
--poses pose parameters of the image stack
-n the number of training epoches, each training epoch loops through all images in the training set
-b the number of images for each batch on each gpu, depends on the size of available gpu memory
--zdim the dimension of latent encodings, increase the zdim will improve the fitting capacity of neural network, but may risk of overfitting
--lr the initial learning rate for adam optimizer, 1.e-4 should work fine.
--num-gpus the number of gpus used for training, note that the total number of images in the total batch will be n*num-gpus
--multigpu toggle on the data parallel version
--beta-control the restraint strength of the beta-vae prior, the larger the argument, the stronger the restraint. The scale of beta-control should be propotional to the SNR of dataset. Suitable beta-control might help disentanglement by increasing the magnitude of latent encodings and the sparsity of latent encodings, for more details, check out beta vae paper. In our implementation, we adjust the scale of beta-control automatically based on SNR estimation, possible ranges of this argument are [0.5-4.]. You can use larger beta-control for dataset with higher SNR
--beta the schedule for restraint stengths, cos implements the cyclic annealing schedule and is the default option
-o the directory name for storing results, such as model weights, latent encodings
-r the solvent mask created from consensus model, our program will focus on fitting the contents inside the mask (more specifically, the 2D projection of a 3D mask). Since the majority part of image doesn't contain electron density, using the original image size is wasteful, by specifying a mask, our program will automatically determine a suitable crop rate to keep only the region with densities.
--downfrac the downsampling fraction of input image, the input to network will be downsampled to size of D*downfrac, where D is the original size of image. You can set it according to resolution of consensus model and the templateres you set.
--lamb the restraint strength of structural disentanglement prior proposed in DSD, set it according to the SNR of your dataset, for dataset with high SNR such as ribosome, spliceosome, you can set it to 1. or higher, for dataset with lower SNR, consider lowering it. Possible ranges are [0.1, 3.]. If you find the UMAP of embeedings is exaggeratedly stretched into a ribbon, then the lamb you used during training is too high!
--split the filename for storing the train-validation split of image stack
--valfrac the fraction of images in the validation set, default is 0.1
--bfactor will apply exp(-bfactor/4 * s^2 * 4*pi^2) decaying to the FT of reconstruction, s is the magnitude of frequency, increase it leads to sharper reconstruction, but takes longer to reveal the part of model with weak density since it actually dampens learning rate, possible ranges are [3, 6]. Consider using higher values for more dynamic structures. We will decay the bfactor slightly in every epoch. This is equivalent to learning rate warming up.
--templateres the size of output volume of our convolutional network, it will be further resampled by spatial transformer before projecting to 2D images. The default value is 192. You may keep it around D*downfrac/0.75, which is larger than the input size. This corresponds to downsampling from the output volume of our network. You can tweak it to other resolutions, larger resolutions can generate sharper density maps, choices are Nx16, where N is integer between 8 and 16
--plot you can also specify this argument if you want to monitor how the reconstruction progress, our program will display the 2D reconstructions and experimental images after 8 times logging intervals. Namely, you switch to interative mode by including this. The interative mode should be run using command python -m cryodrgn.commands.train_cv
--tmp-prefix the prefix of intermediate reconstructions, default value is tmp. OPUS-DSD will output temporary reconstructions to the root directory of this program when training, whose names are $tmp-prefix.mrc
--notinmem include this arguement to let OPUS-DSD reading image stacks from hard disk during training, this is helpful when a huge dataset cannot fit in the memory

The plot mode will ouput the following images in the directory where you issued the training command:

image

Each row shows a selected image and its reconstruction from a batch. In the first row, the first image is the experimental image supplemented to encoder, the second image is a 2D reconstruction blurred by the corresponding CTF, the third image is the correpsonding experimental image after 2D masking.

You can use nohup to let the above command execute in background and use redirections like 1>log 2>err to redirect ouput and error messages to the corresponding files. Happy Training! Open an issue when running into any troubles.

To restart execution from a checkpoint, you can use

dsd train_cv /work/all.mrcs --ctf ./sp-ctf.pkl --poses ./sp-pose-euler.pkl --lazy-single -n 20 --pe-type vanilla --encode-mode grad --template-type conv -b 12 --zdim 12 --lr 1.e-4  --num-gpus 4 --multigpu --beta-control 2. --beta cos -o /work/sp -r ./mask.mrc --downfrac 0.75 --lamb 1. --valfrac 0.25 --load /work/sp/weights.0.pkl --latents /work/sp/z.0.pkl --split sp-split.pkl --bfactor 4. --templateres 224
argument explanation
--load the weight checkpoint from the restarting epoch
--latents the latent encodings from the restarting epoch

both are in the output directory

During training, opus-DSD will output temporary volumes called tmp*.mrc (or the prefix you specified), you can check the intermediate results by viewing them in Chimera. Opus-DSD uses 3D volume as intermediate representation, so it requires larger amount of memory, v100 gpus will be sufficient for its training. Its training speed is slower, which requires 2 hours on 4 v100 gpus to finish one epoch on dataset with 20k images. By default, opus-DSD reads all images into memory before training, so it may require some more host memories To disable this behavior, you can include --notinmem into the training command.

train_multi for OPUS-DSD2

To reconstruct the multi-body dynamics, you should use the command dsd train_multi, using dsd train_multi -h to check more details. Tho enbale dynamics reconstruction, you shall specify --masks to load the mask pkl with the parameters for each body. An example command is as below:

dsd train_multi /work/all.mrcs --ctf /work/all_ctf.pkl --poses /work/all_pose_euler.pkl -n 20 -b 12 --zdim 12 --lr 1.e-4 --num-gpus 4 --multigpu --beta-control 2. -o ./ -r /work/MaskCreate/job001/mask.mrc --split /work/pkls/sa-split.pkl --lamb 1. --bfactor 3.75 --downfrac 0.75 --valfrac 0.25 --templateres 224 --masks /work/mask_params.pkl --zaffdim 6 --plot
argument explanation
--zaffdim controls the dimension of dynamics latent space
--masks the pkl file contains the parameters for bodies of the macromolecule

If you omit --masks in the train_multi command, OPUS-DSD2 will estimate a global pose correction instead (train_cv also does this).

analyze result

You can use the analysis scripts in dsdsh to visualize the learned latent space! The analysis procedure is detailed as following.

The analysis scripts can be invoked by calling command like

dsdsh commandx ...

To access detailed usage information for each command, execute the following:

dsdsh commandx -h

sample latent spaces

The first step is to sample the latent space using kmeans and PCA algorithms. Suppose the training results are in /work/sp,

dsdsh analyze /work/sp 16 4 16
                $1    $2 $3 $4
  • $1 is the output directory used in training, which stores weights.*.pkl, z.*.pkl, config.pkl
  • $2 is the epoch number you would like to analyze,
  • $3 is the number of PCs you would like to sample for traversal
  • $4 is the number of clusters for kmeans clustering.

The analysis result will be stored in /work/sp/analyze.16, i.e., the output directory plus the epoch number you analyzed, using the above command. You can find the UMAP with the labeled kmeans centers in /work/sp/analyze.16/kmeans16/umap.png and the umap with particles colored by their projection parameter in /work/sp/analyze.16/umap.png .

reconstruct volumes

After executing the above command once, you may skip the lengthy umap embedding laterly by appending --skip-umap to the command in analyze.sh. Our analysis script will read the pickled umap embeddings directly. The eval_vol command has following options,

If the model is trained by fitting multi-body dynamics, we have a mode, eval_vol, to reconstruct the multi-body dynamics, using the command

dsdsh eval_vol resdir N dpc num apix --masks MASKS --kmeans KMEANS --dfk DFK

This command selects the DFK class from the kmeans{KMEANS} folder as the template volume, which will be deformed according to the dynamics defined by the PC{dpc} of the dynamics latent space. The generated volumes show the dynamics on the selected class along the selected PC, and can be found in defanalyze.{N}/pc{dpc}. Details about each argument can be checked using dsdsh eval_vol -h. As an example, if you execute:

dsdsh eval_vol . 19 dpc 2 2.2 --masks ../mask_test.pkl --kmeans 20 --dfk 4

This will generate volumes along pc 2 in dynamics latent space using the 4th cluster in kmeans20 as template volume.

You can either generate the volume which corresponds to KMeans cluster centroid or traverses the principal component using, (you can check the content of script first, there are two commands, one is used to evaluate volume at kmeans center, another one is for PC traversal, just choose one according to your use case)

dsdsh eval_vol /work/sp 16 kmeans 16 2.2
                 $1     $2   $3   $4  $5
  • $1 is the output directory used in training, which stores weights.*.pkl, z.*.pkl, config.pkl and the clustering result
  • $2 is the epoch number you just analyzed
  • $3 specifies the kind of analysis result where volumes are generated, i.e., kmeans clusters or principal components, use kmeans for kmeans clustering, or pc for principal components
  • $4 is the number of kmeans clusters (or principal component) you used in analysis
  • $5 is the apix of the generated volumes, you can specify a target value

change to directory /work/sp/analyze.16/kmeans16 to checkout the reference*.mrc, which are the reconstructions correspond to the cluster centroids.

You can use

dsdsh eval_vol /work/sp 16 pc 1 2.2
                $1      $2 $3 $4 $5

to generate volumes along pc1. You can check volumes in /work/sp/analyze.16/pc1. You can make a movie using chimerax's vseries feature. An example script for visualizing movie is in analysis_scripts/movie.py. You can show the movie of the volumes by ChimeraX --script "./analysis_scripts/movie.py reference 0.985. PCs are great for visualizing the main motions and compositional changes of marcomolecules, while KMeans reveals representative conformations in higher qualities. If you want to invert the handness of reconstruction, you can include --flip to the above commands.

select particles

Finally, you can also retrieve the star files for images in each kmeans cluster using

dsdsh parse_pose /work/consensus_data.star 320 1.699 /work/sp 16 16 --relion31
                                $1           $2  $3    $4     $5 $6  $7
  • $1 is the star file of all images
  • $2 is the dimension of image
  • $3 is apix value of image
  • $4 is the output directory used in training
  • $5 is the epoch number you just analyzed
  • $6 is the number of kmeans clusters you used in analysis
  • $7 indicates the version of starfile, only include this when the version of starfile is higher than 3.0

change to directory /work/sp/analyze.16/kmeans16 to checkout the starfile for images in each cluster.