Skip to content

Prioritising the chain ID for segment id #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xiki-tempula opened this issue Jul 27, 2020 · 17 comments · Fixed by #4965
Closed

Prioritising the chain ID for segment id #2874

xiki-tempula opened this issue Jul 27, 2020 · 17 comments · Fixed by #4965

Comments

@xiki-tempula
Copy link
Contributor

Expected behavior

For a PDB with the following line
'ATOM 3516 N THR B 1 23.222 35.476 73.074 0.00 0.00 A N'
Since location 22 is B, the segment id of this residue should be B.
However, when this PDB is loaded to the PDB reader the segment id is recognized as A.

>>> u.segments[0]
<Segment A >

I was checking the PDB standard http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
and I think the character identifying the chain id should be 22 Character chainID Chain identifier..

Should we prioritize the chain ID showing in location 22 instead of the string in other locations?

Current version of MDAnalysis

  • Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)") 1.0.0
  • Which version of Python (python -V)? 3.8
  • Which operating system?
@lilyminium
Copy link
Member

CHARMM treats segments and chain IDs separately, with the former being a 4 letter ID starting at 73; the current PDBParser derives from that. I don't think columns 67-76 are actually defined in the PDB spec. However, if the segments column is empty in a file, PDBParser does assign segments from chain IDs.

@orbeckst
Copy link
Member

Segments aren't chains. If someone added segids in the non-standard field then we assume that they did it with a purpose.

I think what's really missing is a way to select for chains independently from segid. If we don't have an issue for a "chain selection" then we should open one (I think I recently had a discussion with @ianmkenney about this issue...). In fact, why not have any TopologyAttr automatically register itself with the selection parser and add its own keyword?

@orbeckst
Copy link
Member

I am leaning towards closing this issue with a wontfix. Are there different opinions?

@xiki-tempula
Copy link
Contributor Author

I think maybe issuing a warning when there is conflicting information with regard to chainid/segid might be a good solution?

@lilyminium
Copy link
Member

In fact, why not have any TopologyAttr automatically register itself with the selection parser and add its own keyword?

Yeah it wouldn't be too hard I think. Each TopologyAttr would need to declare a type to figure out which base class to subclass off (depending on float, int, str). They would need to get initialised before the attributes that require special behaviour such as ResidSelection, though. Updating the same TopologyAttr as selection would even easier, just update the keyword dictionary; that could be a temporary solution.

@xiki-tempula do people generally expect chainid and segid to be the same thing? If anything I would expect a warning when chainids are used for segids.

@xiki-tempula
Copy link
Contributor Author

xiki-tempula commented Jul 28, 2020

@lilyminium I think there is currently no way of selecting atoms based on chain id, so I usually use segid to select chain.

@mieczyslaw
Copy link
Contributor

mieczyslaw commented Jul 30, 2020

CHARMM treats segments and chain IDs separately, with the former being a 4 letter ID starting at 73; the current PDBParser derives from that. I don't think columns 67-76 are actually defined in the PDB spec. However, if the segments column is empty in a file, PDBParser does assign segments from chain IDs.

It seems to be not so simple when there is no segID to accept column 22 - if one chain has chainID set in column 22, and another has segID set, MDA segid list will contain only the chain set as segid, it won't add chains with chainID set in column 22. Maybe a bit unusual mix of positions, but good to repair this (or raise an exception that the position needs to be consistent).

@lilyminium
Copy link
Member

if one chain has chainID set in column 22, and another has segID set, MDA segid list will contain only the chain set as segid, it won't add chains with chainID set in column 22

If by this you mean that when both chainIDs and segIDs are present, MDAnalysis will not incorporate chainIDs in the segment names: that’s because it presumes you are using them to denote different information and will only use the segIDs for the segments and chainIDs for the chains.

If you meant that the chains are not added when segment IDs are not present, that was fixed in #2390 (issue #2389). This would have been in 1.0.0 so that behaviour should go away if you have that version installed.

@mieczyslaw
Copy link
Contributor

if one chain has chainID set in column 22, and another has segID set, MDA segid list will contain only the chain set as segid, it won't add chains with chainID set in column 22

If by this you mean that when both chainIDs and segIDs are present, MDAnalysis will not incorporate chainIDs in the segment names: that’s because it presumes you are using them to denote different information and will only use the segIDs for the segments and chainIDs for the chains.

If you meant that the chains are not added when segment IDs are not present, that was fixed in #2390 (issue #2389). This would have been in 1.0.0 so that behaviour should go away if you have that version installed.

Neither of that, for the first chain fill in column 22 with e.g. A and leave segid columns empty, for chain B leave column 22 empty and set B in segid columns. When you read it in into MDA, segids array will only contain 'B' segment. Just discovered that by chance as I copied data from two different sources (one containing only chain id in column 22, another only segid and column 22 empty). This happens in the newest 1.0.0.

@lilyminium
Copy link
Member

Does the segids array does not contain an empty segment with segid=''? That would indeed be a problem. However, I am not getting that behaviour on my own toy PDB:

REMARK ADENYLATE KINASE IN OPEN STATE (4AKE)
REMARK DATE:     6/ 6/ 8     14:36:14      CREATED BY USER: denniej0
REMARK 
CRYST1   80.017   80.017   80.017  60.00  60.00  90.00 P 1           1
ATOM      1 N    MET A   1     -11.921  26.307  10.410  1.00 38.38      
ATOM      2 HT1  MET     1     -11.447  26.741   9.595  1.00  0.00      4AKE
ATOM      3 HT2  MET     1     -12.440  27.042  10.926  1.00  0.00      4AKE
>>> u = mda.Universe("adk_open.pdb")
>>> u.segments
<SegmentGroup with 2 segments>
>>> u.segments.segids
array(['', '4AKE'], dtype=object)
>>> u.atoms.chainIDs
array(['A', '', ''], dtype=object)

As @orbeckst noted above, segments aren't chains, nor vice versa. Column 22 is the chain column; columns 73-77 are the segment column. The segids array should contain an array of '' and 'B' in yoru file only, because those are the only segments in the PDB. The chainIDs array should contain an array of 'A' and '' only, because those are the only chains denoted.

@orbeckst
Copy link
Member

Just to add other related issues that came up over the years

@orbeckst
Copy link
Member

See #2875 (comment) for chain vs segment.

@mieczyslaw
Copy link
Contributor

As @orbeckst noted above, segments aren't chains, nor vice versa. Column 22 is the chain column; columns 73-77 are the segment column. The segids array should contain an array of '' and 'B' in yoru file only, because those are the only segments in the PDB. The chainIDs array should contain an array of 'A' and '' only, because those are the only chains denoted.

@lilyminium you're right! in this mixed case, segids and chainids contain empty items when another is present; sorry for the confusion and thanks for your help!

@RMeli
Copy link
Member

RMeli commented Aug 12, 2021

I recently had problems with this as well. I agree with @xiki-tempula that chainID should be prioritised over segid, at least internally, since the latter is not defined in the standard.


Expected behavior

I expect the following one-residue PDB files to contain the same number of residues (one):

res1_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00  0.00           C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00  0.00           C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00  0.00           C
ATOM    664  CG2 THR A 315      23.384  14.924   1.927  1.00  0.00           C
ATOM    665  OG1 THR A 315      21.172  14.548   1.274  1.00  0.00           O
"""

res2_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00 15.13         A C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00 15.69         A C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00 16.22         A C
ATOM    664  CG2 THR A 315      22.874  15.310   1.747  1.00 17.32         A C
ATOM    665  OG1 THR A 315      21.047  13.922   1.304  1.00 15.14         A O
"""

The only differences between the two residues should be the values of some coordinates and of some b-factors.

Non-standard fields (according to the PDB file format) should make no difference.

Actual behavior

I think the first residue is split according to segids so that it is actually considered as four different residues:

<ResidueGroup [<Residue THR, 315>, <Residue THR, 315>, <Residue THR, 315>, <Residue THR, 315>]>

This is, to me, quite unexpected (and possibly wrong, according to the standard?).

Code to reproduce the behavior

from io import StringIO

import MDAnalysis as mda

print(mda.__version__)

res1_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00  0.00           C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00  0.00           C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00  0.00           C
ATOM    664  CG2 THR A 315      23.384  14.924   1.927  1.00  0.00           C
ATOM    665  OG1 THR A 315      21.172  14.548   1.274  1.00  0.00           O
"""

res2_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00 15.13         A C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00 15.69         A C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00 16.22         A C
ATOM    664  CG2 THR A 315      22.874  15.310   1.747  1.00 17.32         A C
ATOM    665  OG1 THR A 315      21.047  13.922   1.304  1.00 15.14         A O
"""

res1 = mda.Universe(StringIO(res1_str), format='PDB')
res2 = mda.Universe(StringIO(res2_str), format='PDB')

print(res1.residues)
print(res2.residues)
assert len(res1.residues) == len(res2.residues)
2.0.0-dev0
<ResidueGroup [<Residue THR, 315>, <Residue THR, 315>, <Residue THR, 315>, <Residue THR, 315>]>
<ResidueGroup [<Residue THR, 315>]>
Traceback (most recent call last):
  File "test.py", line 32, in <module>
    assert len(res1.residues) == len(res2.residues)
AssertionError

@orbeckst
Copy link
Member

As an FYI (see #947 (comment) ): SEGID in 73-77 was defined in the ATOM record in the PDB deposition guide, so it's not quite true that SEGID is "non-standard". (Unfortunately, the page that I originally linked in the comment does not seem to exist anymore.)

The issue remains that we haven't really made any decisions on how to treat segid vs chain, except that in practice we did not change the behavior that we will only use SEGID to label segments (or otherwise create our own). This may lead to wrong behavior as shown in #2874 (comment) above.

@yuyuan871111
Copy link
Contributor

yuyuan871111 commented Mar 14, 2025

solution for the bugs in #4965 (under discussion in #4948)

But, a test to show it works for prioritizing the chain ID for segment id.

Codes:

from io import StringIO

import MDAnalysis as mda

print(mda.__version__)

res1_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00  0.00           C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00  0.00           C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00  0.00           C
ATOM    664  CG2 THR A 315      23.384  14.924   1.927  1.00  0.00           C
ATOM    665  OG1 THR A 315      21.172  14.548   1.274  1.00  0.00           O
"""

res2_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00 15.13         A C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00 15.69         A C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00 16.22         A C
ATOM    664  CG2 THR A 315      22.874  15.310   1.747  1.00 17.32         A C
ATOM    665  OG1 THR A 315      21.047  13.922   1.304  1.00 15.14         A O
"""

res1 = mda.Universe(StringIO(res1_str), format='PDB', force_chainids_to_segids=True)
res2 = mda.Universe(StringIO(res2_str), format='PDB', force_chainids_to_segids=True)

print(res1.residues)
print(res2.residues)
assert len(res1.residues) == len(res2.residues)

Outputs:

2.10.0-dev0
<ResidueGroup [<Residue THR, 315>]>
<ResidueGroup [<Residue THR, 315>]>

orbeckst pushed a commit that referenced this issue Apr 16, 2025
…o segid / Universe: set groups) (#4965)

* Fixes #4948 #2874
* Universe: add a function in the universe set_groups to update the topology and groups in the universe when atomwise resids/segids is given (more generalized method across different file type)
* PDBParser: add an argument force_chainids_to_segids in PDBReader to forcibly reading the chain ID into the segment ID (PDB specific method)
* update docs
* add tests
* update CHANGELOG

---------

Co-authored-by: Cédric Bouysset <bouysset.cedric@gmail.com>
Co-authored-by: Lily Wang <31115101+lilyminium@users.noreply.github.com>
@orbeckst
Copy link
Member

Closed by PR #4965

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants