Skip to content

The second version of Cryo2Struct for reconstructing protein structures from cryo-EM density maps

License

Notifications You must be signed in to change notification settings

BioinfoMachineLearning/Cryo2Strut2

Repository files navigation

Cryo2Struct2

Cryo2Struct2 is a fully automated method for modeling 3D atomic structures from cryo-EM density maps, building on its predecessor, Cryo2Struct. It employs a multi-task deep learning model that integrates sequence-based features from a Protein Language Model (ESM) with cryo-EM density maps, merging feature representation across modalities. The predicted voxels are then used to construct a Hidden Markov Model (HMM), followed by a customized Viterbi algorithm to align sequences and generate initial protein backbone structures. These backbone models are used as templates for AlphaFold3, which further refines the structures for improved accuracy. By integrating cryo-EM data with AlphaFold3 predictions, Cryo2Struct2 improves structure refinement and helps AlphaFold3 to predict accurate structures.

Setup Environment (Locally)

To setup Cryo2Struct2 locally, follow the steps below. It takes about 3-7 minutes to set up the environment to run Cryo2Struct2.

Clone this repository and cd into it

git clone https://github.com/BioinfoMachineLearning/Cryo2Strut2.git
cd ./Cryo2Struct2

We will set up the environment using Anaconda. This is an example of setting up a conda environment to run the code. Use the following command to create the conda environment using the cryo2struct2.yml file.

conda env create -f cryo2struct2.yml
conda activate cryo2struct2

Atomic structure modeling using Cryo2Struct2

  1. Input: cryo-EM density map and sequence : First, you need to prepare your own data or use our provided example data. The directory should be organized as follows:
cryo2struct
|── input
    │── 34610
        β”‚-- emd_34610.map
        |-- 8hb0.fasta
        |-- 8hb0.pdb

The emd_34610.map is the density map with EMD ID: 34610 downloaded from EMDB website. The 8hb0.fasta is the corresponding sequence file.

The 8hb0.pdb file is a PDB structure file used in this test example to generate embeddings using ESM. Alternatively, users can use the 8hb0.fasta file to generate embeddings from ESM.

The first step is to make input cryo-EM map ready for Cryo2Struct2. We run UCSF ChimeraX in non-GUI mode to resample the density map to 1 Angstrom, please install it to preprocess the map. We used ChimeraX 1.4-1 in CentOS 8 system. Once ChimeraX is installed, then please run the following.

bash preprocess/run_data_preparation.bash input/

In the above example input/ is the absolute input path where the maps are present.

Note: For this example, the normalized map is provided, so there is no need to run the above bash command to prepare the map. Hence, the directory structure for this example looks like this:

cryo2struct
|── input
    │── 34610
        β”‚-- emd_34610.map
        |-- emd_normalized_map.mrc
        |-- 8hb0.fasta
        |-- 8hb0.pdb
  1. Set Up ESM: Set up ESM in your system following the instruction provided in https://github.com/facebookresearch/esm . The esm.pretrained model we used is esm2_t36_3B_UR50D(). Change the path of saved ESM model in utils/grid_division.py.

  2. Running Cryo2Struct2: The deep learning requires trained atom and amino acid type models. The trained models are available in Cryo2Struct2 Harvard Dataverse. Use the following to download the trained models.

cd models
wget -O amino_acid_type.ckpt https://dataverse.harvard.edu/api/access/datafile/10888677
wget -O atom_type.ckpt https://dataverse.harvard.edu/api/access/datafile/10888678
cd ..

The organization of the downloaded models should look like:

cryo2struct
|── input
    │── 34610
        β”‚-- emd_34610.map
        |-- emd_normalized_map.mrc
        |-- 8hb0.fasta
        |-- 8hb0.pdb
|── models
    β”‚-- amino_acid_type.ckpt
    |-- atom_type.ckpt
    |-- aa_regression_model.pkl
    |-- ca_regression_model.pkl

Update the configurations in the config/arguments.yml file. Especialy the input data directory, trained model checkpoint path, and density map name. By default the program runs inference in CPU, running the inference program on the GPU speeds up prediction. To enable GPU processing, modify infer_run_on in the configuration file to gpu and provide the GPU device id on infer_on_gpu (example: 0). One way to update the configuration by using visual editor (vi).

vi config/arguments.yml

Compile Modified Viterbi algorithm: The Hidden Markov Model-guided carbon-alpha alignment programs are available in viterbi/. The alignment algorithm is written in C++ program, so compile them using:

cd viterbi
g++ -fPIC -shared -o viterbi.so viterbi.cpp -O3
cd ..

During the compilation, if the program asks for installation of gcc-c++ package, then install it following the instructions. GCC C++ compiler is required to compile viterbi.cpp.

If the compilation of the program fails due to library issues (which typically occurs when attempting to compile on older systems), you can try compiling using the following approach:

cd viterbi
conda install -c conda-forge gxx
g++ -fPIC -shared -o viterbi.so viterbi.cpp -O3
cd ..

The above command installs the gxx package in the activated conda environment, which provides the GCC C++ compiler. This compiler is useful for compiling C++ code on the system. The HMM alignment program runs on the CPU and is optimized at the highest level using the-O3 flag. We tested, and the above compilation was successful on CentOS 7, 8, and AlmaLinux OS 8.8, 8.9.

Finally, run the following:

python3 cryo2struct2.py --density_map_name 34610
  1. Output: Modeled atomic structure The output model will be saved in the density map's directory.

  2. Integrating Cryo2Struct2 Models as Templates for AlphaFold3: The models generated by Cryo2Struct2 are used as templates for AlphaFold3. Use the provided script prepare_script_af3_multichain_multi_template.py to generate .json files that will be used as input to run AlphaFold3.

  3. Set up AlphaFold3: Request AlphaFold3 parameters and follow the instructions to set up AlphaFold3 from here : https://github.com/google-deepmind/alphafold3 .

  4. Run AlphaFold3: Use the script run_af3_docker_all.py to run AlphaFold3 and to predict structures.

Training Cryo2Struct2 Deep Learning

The training programs are available in the train/ directory. Cryo2Struct2 was trained on Cryo2StructData, which is accessible on the Cryo2StructData Dataverse. Download the full dataset from Cryo2Struct Full Dataset or a small subset from Cryo2Struct Small Subsample Dataset. After downloading the dataset, unzip the compressed files. The directory names are the EMD ID of the cryo-EM density map.

The dataset contains the preprocessed map ready for deep learning training. However, the cryo-EM density map label needs to be prepared. Run the following

python3 label/get_atoms_label.py density_map_directory
python3 label/get_amino_labels.py density_map_directory

The density_map_directory is the absolute directory path where unzipped cryo-EM density maps are present. The above scripts generate the atom and amino acid-type labels, which are used during the training of the deep learning model.

Split the data into training and validation sets. If you choose to use our predefined training and validation splits, refer to the Excel sheet in Cryo2StructData Metadata, which contains the IDs for the training and validation cryo-EM density maps. Create separate directories for training and validation, and move the corresponding data to each directory.

Generate sub-grids of cryo-EM density maps from training and validation dataset for training. These sub-grids are used for training the model. Run the following:

python3 train/grid_division_train.py train_map_directory train_sub_grids
python3 train/grid_division_train.py valid_map_directory valid_sub_grids

The train_map_directory is the directory containing training cryo-EM density maps, and train_sub_grids is the directory where the training sub-grids will be generated. Similarly, valid_map_directory is the directory containing validation cryo-EM density maps, and valid_sub_grids is the directory where the validation sub-grids will be generated. After generation of sub-grids, run:

ls train_sub_grids > train_splits.txt
ls valid_sub_grids > valid_splits.txt

We used the distributed data parallel (DDP) technique to train the models on 24 compute nodes, each equipped with 6 NVIDIA V100 GPUs with 32GB of memory. The training program can run on a single GPU, multiple GPUs, or a multi-node cluster with multiple GPUs. Finally, in the training scripts train/train.py change the values in AVAIL_GPUS to the number of GPUs available in the compute node, NUM_NODES to the number of available compute nodes, and set BATCH_SIZE, and DATASET_DIR to the path of the Cryo2Struct directory. Then, train the model by running:

python3 train/train.py    # trains both amino acid-type and atom prediction model

Monitor the training progress in Weights and Biases.

Optional: The source code for data preprocessing, label generation and validation of training data is available at Cryo2StructData GitHub repository.

Contact Information

If you have any question, feel free to open an issue or reach out to us: ngzvh@missouri.edu, chengji@missouri.edu.

Acknowledgements

We thank the High-Performance Computing (HPC) resource, Hellbender, located at the University of Missouri, Columbia, MO, which was used for training, inference and alignment process.

About

The second version of Cryo2Struct for reconstructing protein structures from cryo-EM density maps

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages