Prioritising the chain ID for segment id #2874

xiki-tempula · 2020-07-27T17:09:26Z

Expected behavior

For a PDB with the following line
'ATOM 3516 N THR B 1 23.222 35.476 73.074 0.00 0.00 A N'
Since location 22 is B, the segment id of this residue should be B.
However, when this PDB is loaded to the PDB reader the segment id is recognized as A.

>>> u.segments[0]
<Segment A >

I was checking the PDB standard http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
and I think the character identifying the chain id should be 22 Character chainID Chain identifier..

Should we prioritize the chain ID showing in location 22 instead of the string in other locations?

Current version of MDAnalysis

Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)") 1.0.0
Which version of Python (python -V)? 3.8
Which operating system?

The text was updated successfully, but these errors were encountered:

lilyminium · 2020-07-28T03:14:58Z

CHARMM treats segments and chain IDs separately, with the former being a 4 letter ID starting at 73; the current PDBParser derives from that. I don't think columns 67-76 are actually defined in the PDB spec. However, if the segments column is empty in a file, PDBParser does assign segments from chain IDs.

orbeckst · 2020-07-28T09:12:23Z

Segments aren't chains. If someone added segids in the non-standard field then we assume that they did it with a purpose.

I think what's really missing is a way to select for chains independently from segid. If we don't have an issue for a "chain selection" then we should open one (I think I recently had a discussion with @ianmkenney about this issue...). In fact, why not have any TopologyAttr automatically register itself with the selection parser and add its own keyword?

orbeckst · 2020-07-28T09:13:50Z

I am leaning towards closing this issue with a wontfix. Are there different opinions?

xiki-tempula · 2020-07-28T09:15:40Z

I think maybe issuing a warning when there is conflicting information with regard to chainid/segid might be a good solution?

lilyminium · 2020-07-28T13:11:56Z

In fact, why not have any TopologyAttr automatically register itself with the selection parser and add its own keyword?

Yeah it wouldn't be too hard I think. Each TopologyAttr would need to declare a type to figure out which base class to subclass off (depending on float, int, str). They would need to get initialised before the attributes that require special behaviour such as ResidSelection, though. Updating the same TopologyAttr as selection would even easier, just update the keyword dictionary; that could be a temporary solution.

@xiki-tempula do people generally expect chainid and segid to be the same thing? If anything I would expect a warning when chainids are used for segids.

xiki-tempula · 2020-07-28T13:22:19Z

@lilyminium I think there is currently no way of selecting atoms based on chain id, so I usually use segid to select chain.

mieczyslaw · 2020-07-30T20:29:49Z

CHARMM treats segments and chain IDs separately, with the former being a 4 letter ID starting at 73; the current PDBParser derives from that. I don't think columns 67-76 are actually defined in the PDB spec. However, if the segments column is empty in a file, PDBParser does assign segments from chain IDs.

It seems to be not so simple when there is no segID to accept column 22 - if one chain has chainID set in column 22, and another has segID set, MDA segid list will contain only the chain set as segid, it won't add chains with chainID set in column 22. Maybe a bit unusual mix of positions, but good to repair this (or raise an exception that the position needs to be consistent).

lilyminium · 2020-07-30T21:29:46Z

if one chain has chainID set in column 22, and another has segID set, MDA segid list will contain only the chain set as segid, it won't add chains with chainID set in column 22

If by this you mean that when both chainIDs and segIDs are present, MDAnalysis will not incorporate chainIDs in the segment names: that’s because it presumes you are using them to denote different information and will only use the segIDs for the segments and chainIDs for the chains.

If you meant that the chains are not added when segment IDs are not present, that was fixed in #2390 (issue #2389). This would have been in 1.0.0 so that behaviour should go away if you have that version installed.

mieczyslaw · 2020-07-30T22:40:41Z

if one chain has chainID set in column 22, and another has segID set, MDA segid list will contain only the chain set as segid, it won't add chains with chainID set in column 22

If by this you mean that when both chainIDs and segIDs are present, MDAnalysis will not incorporate chainIDs in the segment names: that’s because it presumes you are using them to denote different information and will only use the segIDs for the segments and chainIDs for the chains.

If you meant that the chains are not added when segment IDs are not present, that was fixed in #2390 (issue #2389). This would have been in 1.0.0 so that behaviour should go away if you have that version installed.

Neither of that, for the first chain fill in column 22 with e.g. A and leave segid columns empty, for chain B leave column 22 empty and set B in segid columns. When you read it in into MDA, segids array will only contain 'B' segment. Just discovered that by chance as I copied data from two different sources (one containing only chain id in column 22, another only segid and column 22 empty). This happens in the newest 1.0.0.

lilyminium · 2020-07-30T23:21:21Z

Does the segids array does not contain an empty segment with segid=''? That would indeed be a problem. However, I am not getting that behaviour on my own toy PDB:

REMARK ADENYLATE KINASE IN OPEN STATE (4AKE)
REMARK DATE:     6/ 6/ 8     14:36:14      CREATED BY USER: denniej0
REMARK 
CRYST1   80.017   80.017   80.017  60.00  60.00  90.00 P 1           1
ATOM      1 N    MET A   1     -11.921  26.307  10.410  1.00 38.38      
ATOM      2 HT1  MET     1     -11.447  26.741   9.595  1.00  0.00      4AKE
ATOM      3 HT2  MET     1     -12.440  27.042  10.926  1.00  0.00      4AKE

>>> u = mda.Universe("adk_open.pdb")
>>> u.segments
<SegmentGroup with 2 segments>
>>> u.segments.segids
array(['', '4AKE'], dtype=object)
>>> u.atoms.chainIDs
array(['A', '', ''], dtype=object)

As @orbeckst noted above, segments aren't chains, nor vice versa. Column 22 is the chain column; columns 73-77 are the segment column. The segids array should contain an array of '' and 'B' in yoru file only, because those are the only segments in the PDB. The chainIDs array should contain an array of 'A' and '' only, because those are the only chains denoted.

orbeckst · 2020-07-30T23:29:10Z

Just to add other related issues that came up over the years

orbeckst · 2020-07-30T23:30:30Z

See #2875 (comment) for chain vs segment.

mieczyslaw · 2020-07-31T14:03:00Z

As @orbeckst noted above, segments aren't chains, nor vice versa. Column 22 is the chain column; columns 73-77 are the segment column. The segids array should contain an array of '' and 'B' in yoru file only, because those are the only segments in the PDB. The chainIDs array should contain an array of 'A' and '' only, because those are the only chains denoted.

@lilyminium you're right! in this mixed case, segids and chainids contain empty items when another is present; sorry for the confusion and thanks for your help!

RMeli · 2021-08-12T20:52:16Z

I recently had problems with this as well. I agree with @xiki-tempula that chainID should be prioritised over segid, at least internally, since the latter is not defined in the standard.

Expected behavior

I expect the following one-residue PDB files to contain the same number of residues (one):

res1_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00  0.00           C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00  0.00           C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00  0.00           C
ATOM    664  CG2 THR A 315      23.384  14.924   1.927  1.00  0.00           C
ATOM    665  OG1 THR A 315      21.172  14.548   1.274  1.00  0.00           O
"""

res2_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00 15.13         A C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00 15.69         A C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00 16.22         A C
ATOM    664  CG2 THR A 315      22.874  15.310   1.747  1.00 17.32         A C
ATOM    665  OG1 THR A 315      21.047  13.922   1.304  1.00 15.14         A O
"""

The only differences between the two residues should be the values of some coordinates and of some b-factors.

Non-standard fields (according to the PDB file format) should make no difference.

Actual behavior

I think the first residue is split according to segids so that it is actually considered as four different residues:

<ResidueGroup [<Residue THR, 315>, <Residue THR, 315>, <Residue THR, 315>, <Residue THR, 315>]>

This is, to me, quite unexpected (and possibly wrong, according to the standard?).

Code to reproduce the behavior

from io import StringIO

import MDAnalysis as mda

print(mda.__version__)

res1_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00  0.00           C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00  0.00           C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00  0.00           C
ATOM    664  CG2 THR A 315      23.384  14.924   1.927  1.00  0.00           C
ATOM    665  OG1 THR A 315      21.172  14.548   1.274  1.00  0.00           O
"""

res2_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00 15.13         A C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00 15.69         A C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00 16.22         A C
ATOM    664  CG2 THR A 315      22.874  15.310   1.747  1.00 17.32         A C
ATOM    665  OG1 THR A 315      21.047  13.922   1.304  1.00 15.14         A O
"""

res1 = mda.Universe(StringIO(res1_str), format='PDB')
res2 = mda.Universe(StringIO(res2_str), format='PDB')

print(res1.residues)
print(res2.residues)
assert len(res1.residues) == len(res2.residues)

2.0.0-dev0
<ResidueGroup [<Residue THR, 315>, <Residue THR, 315>, <Residue THR, 315>, <Residue THR, 315>]>
<ResidueGroup [<Residue THR, 315>]>
Traceback (most recent call last):
  File "test.py", line 32, in <module>
    assert len(res1.residues) == len(res2.residues)
AssertionError

orbeckst · 2025-03-13T23:26:45Z

As an FYI (see #947 (comment) ): SEGID in 73-77 was defined in the ATOM record in the PDB deposition guide, so it's not quite true that SEGID is "non-standard". (Unfortunately, the page that I originally linked in the comment does not seem to exist anymore.)

The issue remains that we haven't really made any decisions on how to treat segid vs chain, except that in practice we did not change the behavior that we will only use SEGID to label segments (or otherwise create our own). This may lead to wrong behavior as shown in #2874 (comment) above.

yuyuan871111 · 2025-03-14T11:56:20Z

solution for the bugs in #4965 (under discussion in #4948)

But, a test to show it works for prioritizing the chain ID for segment id.

Codes:

from io import StringIO

import MDAnalysis as mda

print(mda.__version__)

res1_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00  0.00           C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00  0.00           C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00  0.00           C
ATOM    664  CG2 THR A 315      23.384  14.924   1.927  1.00  0.00           C
ATOM    665  OG1 THR A 315      21.172  14.548   1.274  1.00  0.00           O
"""

res2_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00 15.13         A C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00 15.69         A C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00 16.22         A C
ATOM    664  CG2 THR A 315      22.874  15.310   1.747  1.00 17.32         A C
ATOM    665  OG1 THR A 315      21.047  13.922   1.304  1.00 15.14         A O
"""

res1 = mda.Universe(StringIO(res1_str), format='PDB', force_chainids_to_segids=True)
res2 = mda.Universe(StringIO(res2_str), format='PDB', force_chainids_to_segids=True)

print(res1.residues)
print(res2.residues)
assert len(res1.residues) == len(res2.residues)

Outputs:

2.10.0-dev0
<ResidueGroup [<Residue THR, 315>]>
<ResidueGroup [<Residue THR, 315>]>

…o segid / Universe: set groups) (#4965) * Fixes #4948 #2874 * Universe: add a function in the universe set_groups to update the topology and groups in the universe when atomwise resids/segids is given (more generalized method across different file type) * PDBParser: add an argument force_chainids_to_segids in PDBReader to forcibly reading the chain ID into the segment ID (PDB specific method) * update docs * add tests * update CHANGELOG --------- Co-authored-by: Cédric Bouysset <bouysset.cedric@gmail.com> Co-authored-by: Lily Wang <31115101+lilyminium@users.noreply.github.com>

orbeckst · 2025-04-16T16:07:10Z

Closed by PR #4965

orbeckst added the Format-PDB label Jul 28, 2020

orbeckst added the Component-Topology label Jul 28, 2020

xiki-tempula mentioned this issue Jul 28, 2020

Allow select_atoms to select chain #2875

Closed

yuyuan871111 mentioned this issue Mar 11, 2025

Add force read chain to segment when reading a PDB file / set groups by ids #4948

Closed

yuyuan871111 mentioned this issue Mar 14, 2025

Update ResidueGroup/SegmentGroup with ids (PDBParser: force chainid to segid / Universe: set groups) #4965

Merged

5 tasks

orbeckst linked a pull request Apr 16, 2025 that will close this issue

Update ResidueGroup/SegmentGroup with ids (PDBParser: force chainid to segid / Universe: set groups) #4965

Merged

5 tasks

orbeckst closed this as completed Apr 16, 2025

Prioritising the chain ID for segment id #2874

Prioritising the chain ID for segment id #2874

Comments

xiki-tempula commented Jul 27, 2020

Expected behavior

Current version of MDAnalysis

lilyminium commented Jul 28, 2020

Uh oh!

orbeckst commented Jul 28, 2020

Uh oh!

orbeckst commented Jul 28, 2020

Uh oh!

xiki-tempula commented Jul 28, 2020

Uh oh!

lilyminium commented Jul 28, 2020

Uh oh!

xiki-tempula commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mieczyslaw commented Jul 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lilyminium commented Jul 30, 2020

Uh oh!

mieczyslaw commented Jul 30, 2020

Uh oh!

lilyminium commented Jul 30, 2020

Uh oh!

orbeckst commented Jul 30, 2020

Uh oh!

orbeckst commented Jul 30, 2020

Uh oh!

mieczyslaw commented Jul 31, 2020

Uh oh!

RMeli commented Aug 12, 2021

Expected behavior

Actual behavior

Code to reproduce the behavior

Uh oh!

orbeckst commented Mar 13, 2025

Uh oh!

yuyuan871111 commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orbeckst commented Apr 16, 2025

Uh oh!

xiki-tempula commented Jul 28, 2020 •

edited

Loading

mieczyslaw commented Jul 30, 2020 •

edited

Loading

yuyuan871111 commented Mar 14, 2025 •

edited

Loading