Skip to content

Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 101 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
aa2a88f
Start working on MMCIF parser
marinegor May 22, 2024
218cf43
Add first (not working) version of MMCIFReader and MMCIF topology parser
marinegor May 22, 2024
7f78e02
Do some squashing
marinegor May 22, 2024
6682d6e
Remove inherited docs
marinegor May 22, 2024
817f3a0
Try improving the parsing
marinegor May 22, 2024
3cc8c80
Try three independent loops over the model
marinegor May 30, 2024
f1bf325
Merge remote-tracking branch 'upstream/develop' into feature/mmcif
marinegor Jul 25, 2024
d21c220
Add gemmi dependency
marinegor Sep 13, 2024
2a1be15
necessary params
marinegor Sep 20, 2024
77645e6
finished sorting atom attrs
marinegor Sep 20, 2024
91e6942
add function for transformation into *idx
marinegor Sep 20, 2024
9a0c086
oh damn seems to finally be working
marinegor Sep 20, 2024
9c731df
remove TODOs
marinegor Sep 20, 2024
8b40ec7
Remove debug prints
marinegor Sep 20, 2024
bdcbd73
Merge branch 'develop' into feature/mmcif
marinegor Sep 22, 2024
401a4d3
try to pack things into separate class in utils?
marinegor Sep 22, 2024
9c336bd
remove unnecessary functions
marinegor Sep 22, 2024
def88e4
copy all loops into separate functions
marinegor Sep 23, 2024
cabfd37
Move loops over structures into functions
marinegor Sep 23, 2024
4c9d930
Move coordinate fetching into function for the coordinate reader as well
marinegor Sep 23, 2024
184491a
Fix imports
marinegor Sep 23, 2024
3de8565
Start adding documentation
marinegor Sep 30, 2024
ca6ebbb
Reference MMCIFParser in PDBParser
marinegor Oct 1, 2024
45077ad
Add documentation for trajectory and topology parsers
marinegor Oct 1, 2024
9a1a59a
Add mmcif tests
marinegor Oct 2, 2024
27c10d6
Update format specifications
marinegor Oct 2, 2024
950cfcf
Write simple tests
marinegor Oct 2, 2024
8d1a8b5
Merge remote-tracking branch 'upstream/develop' into feature/mmcif
marinegor Oct 24, 2024
ef29338
update github action with gemmi
marinegor Oct 24, 2024
caca17e
fix gemmi import errors
marinegor Oct 24, 2024
f0e49cc
add mmcif testfiles
marinegor Oct 24, 2024
b7ada7c
add mmcif to __all__
marinegor Oct 24, 2024
e80632c
add black instead of ruff
marinegor Oct 25, 2024
10f3124
Merge remote-tracking branch 'origin/feature/mmcif' into feature/mmcif
marinegor Feb 7, 2025
98353fe
fix function signature
marinegor Feb 10, 2025
35fa187
Merge remote-tracking branch 'upstream/develop' into feature/mmcif
marinegor Feb 18, 2025
e68fcce
Add documentation for mmcif coords
marinegor Feb 19, 2025
263e9f1
expand documentation and type annotations
marinegor Feb 20, 2025
ba47d53
add invalid cif and MMCIF rst files
marinegor Feb 20, 2025
9ffb6f2
add mmcif with invalid atom type
marinegor Feb 20, 2025
fcfc6c0
add biopython cif and fix invalid cif formatting
marinegor Feb 20, 2025
0de720e
remove weird docs part
marinegor Feb 20, 2025
236b286
fix fstring
marinegor Feb 20, 2025
b562115
replace version to 2.9.0
marinegor Feb 20, 2025
816b23f
Merge remote-tracking branch 'upstream/develop' into feature/mmcif
marinegor Feb 20, 2025
92ae164
update changelog
marinegor Feb 20, 2025
88c64a3
move gemmi to optional deps
marinegor Feb 20, 2025
59b7e29
fix issue with accidentally updated datafiles
marinegor Feb 20, 2025
f2c23c8
add mmcif to all
marinegor Feb 21, 2025
776676e
Start working on MMCIF parser
marinegor May 22, 2024
71e60f4
Add first (not working) version of MMCIFReader and MMCIF topology parser
marinegor May 22, 2024
36b7125
Do some squashing
marinegor May 22, 2024
b058941
Remove inherited docs
marinegor May 22, 2024
ef30fa7
Try improving the parsing
marinegor May 22, 2024
95572c1
Try three independent loops over the model
marinegor May 30, 2024
a8a9436
Add gemmi dependency
marinegor Sep 13, 2024
6706bbe
necessary params
marinegor Sep 20, 2024
8cf9da4
finished sorting atom attrs
marinegor Sep 20, 2024
f13156b
add function for transformation into *idx
marinegor Sep 20, 2024
dda981c
oh damn seems to finally be working
marinegor Sep 20, 2024
ebdf849
remove TODOs
marinegor Sep 20, 2024
47043f6
Remove debug prints
marinegor Sep 20, 2024
9770d7b
try to pack things into separate class in utils?
marinegor Sep 22, 2024
fd7f70d
remove unnecessary functions
marinegor Sep 22, 2024
1493056
copy all loops into separate functions
marinegor Sep 23, 2024
3d7fbb9
Move loops over structures into functions
marinegor Sep 23, 2024
9b9286e
Move coordinate fetching into function for the coordinate reader as well
marinegor Sep 23, 2024
b8f3c04
Fix imports
marinegor Sep 23, 2024
0f38a2d
Start adding documentation
marinegor Sep 30, 2024
b915aab
Reference MMCIFParser in PDBParser
marinegor Oct 1, 2024
0d61248
Add documentation for trajectory and topology parsers
marinegor Oct 1, 2024
34d76ca
Add mmcif tests
marinegor Oct 2, 2024
b242aa5
Update format specifications
marinegor Oct 2, 2024
4fc3a78
Write simple tests
marinegor Oct 2, 2024
14fa756
fix actions
marinegor Feb 22, 2025
e3a9a1f
fix gemmi import errors
marinegor Oct 24, 2024
d492b4e
add mmcif testfiles
marinegor Oct 24, 2024
1880e4a
add mmcif to __all__
marinegor Oct 24, 2024
927d7a0
add black instead of ruff
marinegor Oct 25, 2024
ad0f0be
fix function signature
marinegor Feb 10, 2025
e03c3e5
Add documentation for mmcif coords
marinegor Feb 19, 2025
4d79205
expand documentation and type annotations
marinegor Feb 20, 2025
32d7cf9
add invalid cif and MMCIF rst files
marinegor Feb 20, 2025
0df8c3a
add mmcif with invalid atom type
marinegor Feb 20, 2025
05c6ea1
add biopython cif and fix invalid cif formatting
marinegor Feb 20, 2025
88dab79
remove weird docs part
marinegor Feb 20, 2025
a82fe52
fix fstring
marinegor Feb 20, 2025
e3f1714
replace version to 2.9.0
marinegor Feb 20, 2025
db46016
fix actions
marinegor Feb 20, 2025
32cd103
fix datafiles
marinegor Feb 22, 2025
805089e
add mmcif to all
marinegor Feb 21, 2025
55c3dbb
add mmcif to coordinates and topology modules
marinegor Feb 22, 2025
cd201d0
update docs following yuxuanzhuang comments
marinegor Feb 22, 2025
81f0b5b
merge remote
marinegor Feb 22, 2025
22d1cca
add linked issues and prs to changelog
marinegor Feb 22, 2025
d1ba434
remove mmcif files from black ignore
marinegor Feb 23, 2025
a03b56f
add tests for multimodel file warnings
marinegor Feb 23, 2025
bd4c255
add tests for cryst1 warnings
marinegor Feb 23, 2025
3d61dc5
black
marinegor Feb 23, 2025
aed9b54
add invalid cif file itself
marinegor Feb 23, 2025
53c51f4
format datafiles with black
marinegor Feb 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .github/actions/setup-deps/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ inputs:
default: 'codecov'
cython:
default: 'cython'
filelock:
default: 'filelock'
fasteners:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fasteners:
filelock:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's on me not merging develop-1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is not fixed yet?

default: 'fasteners'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add optional deps down in the optional deps section below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default: 'fasteners'
default: 'filelock'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's on me not merging develop-2

griddataformats:
default: 'griddataformats'
gsd:
Expand Down Expand Up @@ -60,6 +60,8 @@ inputs:
default: 'dask'
distopia:
default: 'distopia>=0.4.0'
gemmi:
default: 'gemmi'
h5py:
default: 'h5py>=2.10'
hole2:
Expand Down Expand Up @@ -130,6 +132,7 @@ runs:
${{ inputs.dask }}
${{ inputs.distopia }}
${{ inputs.gsd }}
${{ inputs.gemmi }}
${{ inputs.h5py }}
${{ inputs.hole2 }}
${{ inputs.joblib }}
Expand Down
2 changes: 2 additions & 0 deletions package/CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ Fixes
the function to prevent shared state. (Issue #4655)

Enhancements
* Implementation of PDBx/MMCIF coordinate and topology reader based on
`gemmi` library (fixes issue #2367, extends issue #4303, PR #4712)
* Improve distopia backend support in line with new functionality available
in distopia >= 0.3.1 (PR #4734)
* Addition of 'water' token for water selection (Issue #4839)
Expand Down
147 changes: 147 additions & 0 deletions package/MDAnalysis/coordinates/MMCIF.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*-
# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4
#
"""
MMCIF structure files in MDAnalysis --- :mod:`MDAnalysis.coordinates.MMCIF`
==========================================================================

MDAnalysis reads coordinates from MMCIF (macromolecular Crystallographic Information File) files, also known as PDBx/mmCIF format,
using the ``gemmi`` library as a backend. MMCIF is a more modern and flexible alternative to the PDB format,
capable of storing detailed structural and experimental data about biological macromolecules.

MMCIF files use a structured, tabular format with key-value pairs to store both coordinate and atom information.
The format supports multiple models/frames, though this implementation currently only reads the first model
and provides warning messages for multi-model files.

Basic usage
-----------

Reading an MMCIF file is straightforward:

.. code-block:: python

import MDAnalysis as mda
u = mda.Universe("structure.cif")


The reader will automatically detect if the structure contains placeholder unit cell information
(usually it's the case for cryoEM structures, and cell parameters are (1, 1, 1, 90, 90, 90))
and set dimensions to None in that case.

Capabilities
------------

The MMCIF reader implementation uses the gemmi library to parse files and extract coordinates
and unit cell information. Currently only reading capability is supported, with the following
features:

- Single frame/model reading
- Unit cell dimensions detection
- Support for compressed .cif.gz files
- Automatic handling of placeholder unit cells for cryoEM structures

Examples
--------

Basic structure loading::

.. code-block:: python

# Load structure from MMCIF
u = mda.Universe("structure.cif")

# or from cif.gz file
u = mda.Universe("structure.cif.gz")

Classes
-------

.. autoclass:: MMCIFReader
:members:
:inherited-members:

See Also
--------
- wwPDB MMCIF Resources: <http://mmcif.wwpdb.org>_
- Gemmi library documentation: <https://gemmi.readthedocs.io>_

.. versionadded:: 2.9.0
"""

import logging
import warnings

import numpy as np

from . import base

try:
import gemmi

HAS_GEMMI = True

Check warning on line 81 in package/MDAnalysis/coordinates/MMCIF.py

View check run for this annotation

Codecov / codecov/patch

package/MDAnalysis/coordinates/MMCIF.py#L81

Added line #L81 was not covered by tests
except ImportError:
HAS_GEMMI = False

logger = logging.getLogger("MDAnalysis.coordinates.MMCIF")


def get_coordinates(model: "gemmi.Model") -> np.ndarray:
"""Get coordinates of all atoms in the `gemmi.Model` object.

Parameters
----------
model
input `gemmi.Model`, e.g. `gemmi.read_structure('file.cif')[0]`

Returns
-------
np.ndarray, shape [n, 3], where `n` is the number of atoms in the structure.
"""
return np.array(

Check warning on line 100 in package/MDAnalysis/coordinates/MMCIF.py

View check run for this annotation

Codecov / codecov/patch

package/MDAnalysis/coordinates/MMCIF.py#L100

Added line #L100 was not covered by tests
[[*at.pos.tolist()] for chain in model for res in chain for at in res]
)


class MMCIFReader(base.SingleFrameReaderBase):
"""Reads from an MMCIF file using ``gemmi`` library as a backend.

Notes
-----

If the structure represents an ensemble, only the first structure in the ensemble
is read here (and a warning is thrown). Also, if the structure has a placeholder "CRYST1"
record (1, 1, 1, 90, 90, 90), it's set to ``None`` instead.

.. versionadded:: 2.9.0
"""

format = ["cif", "cif.gz", "mmcif"]
units = {"time": None, "length": "Angstrom"}

def _read_first_frame(self):
structure = gemmi.read_structure(self.filename)
cell_dims = np.array(

Check warning on line 123 in package/MDAnalysis/coordinates/MMCIF.py

View check run for this annotation

Codecov / codecov/patch

package/MDAnalysis/coordinates/MMCIF.py#L122-L123

Added lines #L122 - L123 were not covered by tests
[
getattr(structure.cell, name)
for name in ("a", "b", "c", "alpha", "beta", "gamma")
]
)
if len(structure) > 1:
warnings.warn( # FIXME: add tests for this

Check warning on line 130 in package/MDAnalysis/coordinates/MMCIF.py

View check run for this annotation

Codecov / codecov/patch

package/MDAnalysis/coordinates/MMCIF.py#L130

Added line #L130 was not covered by tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#FIXME :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, you do have multimodel_warning.cif tested---I guess codecov is outdated?

f"File {self.filename} has {len(structure)=} models, but only the first one will be read"
)

model = structure[0]
coords = get_coordinates(model)
self.n_atoms = len(coords)
self.ts = self._Timestep.from_coordinates(coords, **self._ts_kwargs)

Check warning on line 137 in package/MDAnalysis/coordinates/MMCIF.py

View check run for this annotation

Codecov / codecov/patch

package/MDAnalysis/coordinates/MMCIF.py#L134-L137

Added lines #L134 - L137 were not covered by tests
if np.allclose(cell_dims, np.array([1.0, 1.0, 1.0, 90.0, 90.0, 90.0])):
warnings.warn(

Check warning on line 139 in package/MDAnalysis/coordinates/MMCIF.py

View check run for this annotation

Codecov / codecov/patch

package/MDAnalysis/coordinates/MMCIF.py#L139

Added line #L139 was not covered by tests
"1 A^3 CRYST1 record,"
" this is usually a placeholder."
" Unit cell dimensions will be set to None."
)
self.ts.dimensions = None

Check warning on line 144 in package/MDAnalysis/coordinates/MMCIF.py

View check run for this annotation

Codecov / codecov/patch

package/MDAnalysis/coordinates/MMCIF.py#L144

Added line #L144 was not covered by tests
else:
self.ts.dimensions = cell_dims
self.ts.frame = 0

Check warning on line 147 in package/MDAnalysis/coordinates/MMCIF.py

View check run for this annotation

Codecov / codecov/patch

package/MDAnalysis/coordinates/MMCIF.py#L146-L147

Added lines #L146 - L147 were not covered by tests
1 change: 1 addition & 0 deletions package/MDAnalysis/coordinates/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -791,3 +791,4 @@ class can choose an appropriate reader automatically.
from . import NAMDBIN
from . import FHIAIMS
from . import TNG
from . import MMCIF
Loading
Loading