Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

marinegor · 2024-09-20T21:57:20Z

Fixes #2367 and also extends #4303

Changes made in this Pull Request:

uses gemmi library (link) to parse mmcif files
adds a class MMCIFReader(base.SingleFrameReaderBase) and class MMCIFParser(TopologyReaderBase) classes for that

As a bonus, this implementation would potentially allow to read any of the gemmi-supported formats (source):

mmCIF (PDBx/mmCIF),
PDB (with popular extensions),
mmJSON

Also, this (with slight modifications) also would allow reading mmcif with multiple models sharing the same topology, as well as more feature-rich parsing of PDBs (the same code without changes can be used for parsing altlocs, charges, etc, from all of these formats).

However, I'm slightly lost on what's to be done next for this PR to be merged, so I'm asking if someone could help me navigate here (tagging @richardjgowers here as author of original PDBx implementation 4303).

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

Developers certificate of origin

I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.

📚 Documentation preview 📚: https://mdanalysis--4712.org.readthedocs.build/en/4712/

pep8speaks · 2024-09-20T21:57:28Z

Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file package/MDAnalysis/coordinates/MMCIF.py:

Line 28:80: E501 line too long (84 > 79 characters)
Line 41:80: E501 line too long (85 > 79 characters)
Line 42:80: E501 line too long (93 > 79 characters)
Line 61:80: E501 line too long (104 > 79 characters)
Line 65:80: E501 line too long (87 > 79 characters)
Line 67:80: E501 line too long (107 > 79 characters)

In the file package/MDAnalysis/topology/MMCIFParser.py:

Line 2:24: W291 trailing whitespace
Line 60:80: E501 line too long (111 > 79 characters)
Line 72:80: E501 line too long (123 > 79 characters)
Line 82:80: E501 line too long (122 > 79 characters)
Line 106:80: E501 line too long (108 > 79 characters)
Line 113:80: E501 line too long (80 > 79 characters)
Line 128:80: E501 line too long (91 > 79 characters)
Line 175:80: E501 line too long (126 > 79 characters)
Line 185:80: E501 line too long (125 > 79 characters)
Line 224:80: E501 line too long (126 > 79 characters)
Line 242:80: E501 line too long (140 > 79 characters)
Line 281:80: E501 line too long (87 > 79 characters)
Line 292:80: E501 line too long (90 > 79 characters)

In the file package/MDAnalysis/topology/PDBParser.py:

Line 56:80: E501 line too long (80 > 79 characters)
Line 57:80: E501 line too long (84 > 79 characters)

In the file package/MDAnalysis/topology/__init__.py:

Line 335:26: W292 no newline at end of file

In the file testsuite/MDAnalysisTests/datafiles.py:

Line 48:80: E501 line too long (103 > 79 characters)
Line 81:80: E501 line too long (80 > 79 characters)
Line 97:80: E501 line too long (86 > 79 characters)
Line 271:80: E501 line too long (90 > 79 characters)
Line 340:80: E501 line too long (104 > 79 characters)
Line 387:80: E501 line too long (83 > 79 characters)
Line 436:80: E501 line too long (80 > 79 characters)
Line 463:80: E501 line too long (80 > 79 characters)
Line 481:80: E501 line too long (80 > 79 characters)
Line 493:80: E501 line too long (80 > 79 characters)
Line 494:80: E501 line too long (80 > 79 characters)
Line 497:80: E501 line too long (83 > 79 characters)
Line 498:80: E501 line too long (86 > 79 characters)
Line 546:80: E501 line too long (82 > 79 characters)
Line 547:80: E501 line too long (82 > 79 characters)
Line 549:80: E501 line too long (88 > 79 characters)
Line 551:80: E501 line too long (88 > 79 characters)
Line 552:80: E501 line too long (81 > 79 characters)
Line 777:80: E501 line too long (81 > 79 characters)
Line 778:80: E501 line too long (87 > 79 characters)
Line 779:80: E501 line too long (84 > 79 characters)
Line 780:80: E501 line too long (85 > 79 characters)
Line 781:80: E501 line too long (83 > 79 characters)

Comment last updated at 2024-10-25 11:17:29 UTC

github-actions · 2024-09-20T21:59:32Z

Linter Bot Results:

Hi @marinegor! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location	Outcome
main package	⚠️ Possible failure
testsuite	⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/11148966346/job/30986736623

Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

richardjgowers

Looks good so far, will require a small test file to check reader/parser halves.

richardjgowers · 2024-09-21T15:10:13Z

package/MDAnalysis/coordinates/MMCIF.py

+        pass
+
+
+class MMCIFWriter(base.WriterBase):


I wouldn't include this at this stage, Writer is optional

richardjgowers · 2024-09-21T15:10:31Z

package/MDAnalysis/topology/MMCIFParser.py

+from .base import TopologyReaderBase
+
+
+def _into_idx(arr: list[int]) -> list[int]:


Document what this does, ideally with an example

richardjgowers · 2024-09-21T15:11:33Z

package/MDAnalysis/topology/MMCIFParser.py

+            record_types,  # res.het_flag
+            tempfactors,  # at.b_iso
+            residx,  # _into_idx(res.seqid.num)
+        ) = map(  # this is probably not pretty, but it's efficient -- one loop over the mmcif


Are all the fields here guaranteed in a valid pdbx? One benefit to working column by column is that you can do optional columns

Do you have an example of a PDBx in mind, or like a test set for them? I've never actually worked with the format, since in RCSB afaik we have only pdb or mmcif

PDBx is mmcif. The download links here will give you an example file: https://www.rcsb.org/structure/4ake we use 4ake elsewhere in the testsuite. In my experience, sometimes the PDB / mmcif versions of the same entry aren't completely identical, so I wouldn't worry about trying to align the PDB & PDBx tests.

richardjgowers · 2024-09-21T15:12:17Z

package/MDAnalysis/topology/MMCIFParser.py

+            np.array,
+            list(
+                zip(
+                    *[


I'm struggling to follow the logic here, a comment breaking down what this double nested loop iteration into a zip is doing would be nice

richardjgowers · 2024-09-21T15:12:54Z

package/pyproject.toml

@@ -78,6 +78,7 @@ extra_formats = [
    "pytng>=0.2.3",
    "gsd>3.0.0",
    "rdkit>=2020.03.1",
+    "gemmi", # for mmcif format


This will probably be optional, so other imports will have to respect that too

marinegor · 2025-02-22T14:29:51Z

and my bad for the messed up commit history, I certainly have to not try switching git tools in the middle of a PR🥲

marinegor · 2025-02-23T13:02:01Z

Thanks to @chmnk's contribution expressed as a test file, I've added multimodel structure warnings, and now I think codecov must be happy as well.

marinegor · 2025-02-24T11:09:56Z

all checks done, tests added -- pinging @yuxuanzhuang and @IAlibay (and also @richardjgowers could you please unassign yourself if you don't have the bandwidth?)

yuxuanzhuang · 2025-03-03T18:40:00Z

I ran into a strange bug and don’t have time to investigate it right now. Could you take a look and see if you can fix it?

import MDAnalysis as mda
from MDAnalysisTests.datafiles import MMCIF as MMCIF_FOLDER

u = mda.Universe(f"{MMCIF_FOLDER}/1YJP.cif")

print(u.atoms.resids)
> [ 1  1  1  1  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  4  4  4  4
  4  4  4  4  4  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  7  7
  7  7  7  7  7  7  7  7  7  7  7  8  9 10 11 12 13 14]

# weird weird
print(u.select_atoms('resid 1'))
> <AtomGroup []>

# this should print both resid 1 and 2.
print(u.select_atoms('resid 1-2'))
<AtomGroup [<Atom 1: N of type N of resname GLY, resid 1 and segid A and altLoc A>, <Atom 2: CA of type CA of resname GLY, resid 1 and segid A and altLoc A>, <Atom 3: C of type C of resname GLY, resid 1 and segid A and altLoc A>, <Atom 4: O of type O of resname GLY, resid 1 and segid A and altLoc A>]>

yuxuanzhuang · 2025-03-18T07:30:17Z

The issue can be traced back to how selections handle insertion codes (icodes)

mdanalysis/package/MDAnalysis/core/selection.py

Line 833 in dcaa087

def _sel_with_icodes(self, vals, icodes):

In mmCIF, icodes are read and stored as a list of ' ' (with space) if empty, but during selection, they are compared to '' (empty strings).

A potential fix would be to strip all icodes during MMCIFParsing:

icodes = [icode.strip() for icode in icodes]

yuxuanzhuang · 2025-03-18T07:36:30Z

package/MDAnalysis/coordinates/MMCIF.py

+            ]
+        )
+        if len(structure) > 1:
+            warnings.warn(  # FIXME: add tests for this


Wait, you do have multimodel_warning.cif tested---I guess codecov is outdated?

ljwoods2 · 2025-04-03T03:47:46Z

I found a strange edge case that makes the parser do weird things even when using @yuxuanzhuang 's fix above:

The MMCIFParser uses residue.segid.num from Gemmi as residue number and relies on it uniquely identify different residues in the topology (via _into_index). However, seqid.num does not uniquely identify a residue- Gemmi is using _atom_site.auth_seq_id from the cif file to create seqid.num, but isn't using this same cif file attribute to identify unique residues- there are cases in the PDB where there are two residues (that are different amino acids and have their own CAs) that share the same _atom_site.auth_seq_id.

This leads to cases where multiple different residues in the Gemmi struct have the same seqid.num, which leads to strange behavior in MDAnalysis since it throws off the indexing.

For example, from 1BD2:

ATOM   3481 N N   . ASP D 4 54  ? -11.247 37.547 33.563  1.00 50.85  ? 54  ASP D N   1 
ATOM   3482 C CA  . ASP D 4 54  ? -12.205 36.560 33.111  1.00 52.70  ? 54  ASP D CA  1 
ATOM   3483 C C   . ASP D 4 54  ? -12.947 37.037 31.868  1.00 53.84  ? 54  ASP D C   1 
ATOM   3484 O O   . ASP D 4 54  ? -13.073 36.300 30.893  1.00 56.20  ? 54  ASP D O   1 
ATOM   3485 C CB  . ASP D 4 54  ? -13.172 36.206 34.233  1.00 61.51  ? 54  ASP D CB  1 
ATOM   3486 C CG  . ASP D 4 54  ? -12.569 35.228 35.230  1.00 70.73  ? 54  ASP D CG  1 
ATOM   3487 O OD1 . ASP D 4 54  ? -12.278 34.080 34.834  1.00 74.85  ? 54  ASP D OD1 1 
ATOM   3488 O OD2 . ASP D 4 54  ? -12.380 35.596 36.412  1.00 77.11  ? 54  ASP D OD2 1 
ATOM   3489 N N   . LYS D 4 55  A -13.434 38.269 31.897  1.00 48.71  ? 54  LYS D N   1 
ATOM   3490 C CA  . LYS D 4 55  A -14.139 38.813 30.752  1.00 44.59  ? 54  LYS D CA  1 
ATOM   3491 C C   . LYS D 4 55  A -13.515 40.149 30.351  1.00 46.70  ? 54  LYS D C   1 
ATOM   3492 O O   . LYS D 4 55  A -12.800 40.767 31.140  1.00 47.77  ? 54  LYS D O   1 
ATOM   3493 C CB  . LYS D 4 55  A -15.613 38.992 31.079  1.00 41.68  ? 54  LYS D CB  1

These two residues have a different _atom_site.label_seq_id but an identical _atom_site.auth_seq_id (54-55 vs 54).

As a result, for this PDB entry's cif file, when this line is called in get_Atomattrs, there are an assumed 853 unique residues in the topology.

residx = np.array(_into_idx(residx))

len(np.unique(residx))
> 853

However, the corresponding arrays in get_Residueattrs are length 854 since this is the true number of unique residues that Gemmi (correctly) identified despite the two residues above sharing the same auth_seq_id:

len(resids)
> 854

This creates an indexing issue in the atom->residue mapping for all atoms after the atoms in the two residues shown above. The same issue does not exist in the equivalent legacy PDB file likely because the residues are uniquely identified as 54 and 54A (although I haven't read the PDB parser so this is a bit of a guess)

ATOM   3484  N   ASP D  54     -11.247  37.547  33.563  1.00 50.85           N  
ATOM   3485  CA  ASP D  54     -12.205  36.560  33.111  1.00 52.70           C  
ATOM   3486  C   ASP D  54     -12.947  37.037  31.868  1.00 53.84           C  
ATOM   3487  O   ASP D  54     -13.073  36.300  30.893  1.00 56.20           O  
ATOM   3488  CB  ASP D  54     -13.172  36.206  34.233  1.00 61.51           C  
ATOM   3489  CG  ASP D  54     -12.569  35.228  35.230  1.00 70.73           C  
ATOM   3490  OD1 ASP D  54     -12.278  34.080  34.834  1.00 74.85           O  
ATOM   3491  OD2 ASP D  54     -12.380  35.596  36.412  1.00 77.11           O  
ATOM   3492  N   LYS D  54A    -13.434  38.269  31.897  1.00 48.71           N  
ATOM   3493  CA  LYS D  54A    -14.139  38.813  30.752  1.00 44.59           C  
ATOM   3494  C   LYS D  54A    -13.515  40.149  30.351  1.00 46.70           C  
ATOM   3495  O   LYS D  54A    -12.800  40.767  31.140  1.00 47.77           O  
ATOM   3496  CB  LYS D  54A    -15.613  38.992  31.079  1.00 41.68           C

To demonstrate the issue:

u = mda.Universe(
    "1bd2.pdb"
)

tmp = u.select_atoms("segid D and name CA")

tmp

> <AtomGroup with 190 atoms>

But with cif:

u_broken = mda.Universe(
    "1bd2.cif"
)

tmp_brk = u_broken.select_atoms("segid D and name CA")

tmp_brk
> <AtomGroup with 191 atoms>

tmp_brk[-1]
# this is nonsense, doesn't exist in the topology
> <Atom 4518: CA of type CA of resname HOH, resid 223 and segid D and altLoc A>

The actual atom 4518 from the topology:

ATOM   4518 C CA  . THR E 5 7   ? 4.371   42.029 -3.618  1.00 31.35  ? 7   THR E CA  1

I found a few more such examples of PDB entries with cif files that I'm fairly certain are producing nonsense selections for the same reasion:

Gemmi has some limited notes on auth vs label but otherwise I'm not quite sure if this is a PDB error or a parsing error that needs to be fixed- if it is a parsing error, either residue.label_seq (docs) should be used as unique identifiers for residues (in which case some unlabeled atoms will have a label_seq of None and have to be dealt with) or a different unique identifier will have to be constructed (maybe just by appending "A" like has been done in the PDB example)- these formats are outside of my expertise so I can't say.

ljwoods2 · 2025-04-03T19:58:26Z

More concise examples:

u = mda.Universe(
    "1bd2.pdb"
)

tmp = u.select_atoms(f"resid 54 and segid D")
tmp.select_atoms("name CA")
> <AtomGroup with 1 atom>

tmp = u.select_atoms(f"resid 54A and segid D")

tmp.select_atoms("name CA")
> <AtomGroup with 1 atom>

u_broken = mda.Universe(
    "1bd2.cif"
)

tmp_brk = u_broken.select_atoms(f"resid 54 and segid D")

tmp_brk.select_atoms("name CA")
> <AtomGroup with 2 atoms>

Before the broken res (54), the atoms we get from selections actually exist in the CIF file (although the repr of the atom seems to be broken, using atom.ix + 1 yields incorrect atom num in CIF files where residues are not contiguous in the file, though this is maybe a separate issue and maybe inconsequential)

atm = u_broken.select_atoms(f"resid 53 and segid D")[0]
print(atm)
print(atm.id)

> <Atom 3490: N of type N of resname LYS, resid 53 and segid D and altLoc A>
> 3476

From CIF:

ATOM   3476 N N   . LYS D 4 53  ? -7.647  38.061 33.871  1.00 48.96  ? 53  LYS D N   1

After the broken res in the CIF, we get nonsense that doesn't exist in the cif:

atm_brk = u_broken.select_atoms(f"resid 55 and segid D")[0]
print(atm_brk)
print(atm_brk.id)
> <Atom 3516: N of type N of resname ASN, resid 55 and segid D and altLoc A>
> 3502

From CIF:

ATOM   3502 N N   . ALA D 4 57  ? -14.412 43.549 27.357  1.00 51.40  ? 56  ALA D N   1

marinegor added 13 commits May 22, 2024 20:04

Start working on MMCIF parser

aa2a88f

Add first (not working) version of MMCIFReader and MMCIF topology parser

218cf43

Do some squashing

7f78e02

Remove inherited docs

6682d6e

Try improving the parsing

817f3a0

Try three independent loops over the model

3cc8c80

Merge remote-tracking branch 'upstream/develop' into feature/mmcif

f1bf325

Add gemmi dependency

d21c220

necessary params

2a1be15

finished sorting atom attrs

77645e6

add function for transformation into *idx

91e6942

oh damn seems to finally be working

9a0c086

remove TODOs

9c731df

Remove debug prints

8b40ec7

richardjgowers reviewed Sep 21, 2024

View reviewed changes

marinegor added 13 commits September 23, 2024 00:31

Merge branch 'develop' into feature/mmcif

bdcbd73

try to pack things into separate class in utils?

401a4d3

remove unnecessary functions

9c336bd

copy all loops into separate functions

def88e4

Move loops over structures into functions

cabfd37

Move coordinate fetching into function for the coordinate reader as well

4c9d930

Fix imports

184491a

Start adding documentation

3de8565

Reference MMCIFParser in PDBParser

ca6ebbb

Add documentation for trajectory and topology parsers

45077ad

Add mmcif tests

9a1a59a

Update format specifications

27c10d6

Write simple tests

950cfcf

marinegor and others added 13 commits February 22, 2025 15:03

expand documentation and type annotations

4d79205

add invalid cif and MMCIF rst files

32d7cf9

add mmcif with invalid atom type

0df8c3a

add biopython cif and fix invalid cif formatting

05c6ea1

remove weird docs part

88dab79

fix fstring

a82fe52

replace version to 2.9.0

e3f1714

fix actions

db46016

fix datafiles

32cd103

add mmcif to all

805089e

add mmcif to coordinates and topology modules

55c3dbb

update docs following yuxuanzhuang comments

cd201d0

merge remote

81f0b5b

marinegor added 4 commits February 22, 2025 15:36

add linked issues and prs to changelog

22d1cca

remove mmcif files from black ignore

d1ba434

add tests for multimodel file warnings

a03b56f

add tests for cryst1 warnings

bd4c255

marinegor added 3 commits February 23, 2025 14:06

black

3d61dc5

add invalid cif file itself

aed9b54

format datafiles with black

53c51f4

marinegor requested a review from yuxuanzhuang February 24, 2025 11:08

marinegor mentioned this pull request Mar 15, 2025

Add force read chain to segment when reading a PDB file / set groups by ids #4948

Closed

yuxuanzhuang reviewed Mar 18, 2025

View reviewed changes

		from .base import TopologyReaderBase


		def _into_idx(arr: list[int]) -> list[int]:

Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Are you sure you want to change the base?

Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Uh oh!

Conversation

marinegor commented Sep 20, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Developers certificate of origin

Uh oh!

pep8speaks commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2024-10-25 11:17:29 UTC

Uh oh!

github-actions bot commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linter Bot Results:

Uh oh!

richardjgowers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marinegor commented Feb 22, 2025

Uh oh!

marinegor commented Feb 23, 2025

Uh oh!

marinegor commented Feb 24, 2025

Uh oh!

yuxuanzhuang commented Mar 3, 2025

Uh oh!

yuxuanzhuang commented Mar 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljwoods2 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljwoods2 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

marinegor commented Sep 20, 2024 •

edited by github-actions bot

Loading

pep8speaks commented Sep 20, 2024 •

edited

Loading

github-actions bot commented Sep 20, 2024 •

edited

Loading

ljwoods2 commented Apr 3, 2025 •

edited

Loading

ljwoods2 commented Apr 3, 2025 •

edited

Loading