-
Notifications
You must be signed in to change notification settings - Fork 713
Prioritising the chain ID for segment id #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CHARMM treats segments and chain IDs separately, with the former being a 4 letter ID starting at 73; the current PDBParser derives from that. I don't think columns 67-76 are actually defined in the PDB spec. However, if the segments column is empty in a file, PDBParser does assign segments from chain IDs. |
Segments aren't chains. If someone added segids in the non-standard field then we assume that they did it with a purpose. I think what's really missing is a way to select for chains independently from segid. If we don't have an issue for a "chain selection" then we should open one (I think I recently had a discussion with @ianmkenney about this issue...). In fact, why not have any TopologyAttr automatically register itself with the selection parser and add its own keyword? |
I am leaning towards closing this issue with a wontfix. Are there different opinions? |
I think maybe issuing a warning when there is conflicting information with regard to chainid/segid might be a good solution? |
Yeah it wouldn't be too hard I think. Each TopologyAttr would need to declare a type to figure out which base class to subclass off (depending on float, int, str). They would need to get initialised before the attributes that require special behaviour such as ResidSelection, though. Updating the @xiki-tempula do people generally expect chainid and segid to be the same thing? If anything I would expect a warning when chainids are used for segids. |
@lilyminium I think there is currently no way of selecting atoms based on chain id, so I usually use segid to select chain. |
It seems to be not so simple when there is no segID to accept column 22 - if one chain has chainID set in column 22, and another has segID set, MDA segid list will contain only the chain set as segid, it won't add chains with chainID set in column 22. Maybe a bit unusual mix of positions, but good to repair this (or raise an exception that the position needs to be consistent). |
If by this you mean that when both chainIDs and segIDs are present, MDAnalysis will not incorporate chainIDs in the segment names: that’s because it presumes you are using them to denote different information and will only use the segIDs for the segments and chainIDs for the chains. If you meant that the chains are not added when segment IDs are not present, that was fixed in #2390 (issue #2389). This would have been in 1.0.0 so that behaviour should go away if you have that version installed. |
Neither of that, for the first chain fill in column 22 with e.g. A and leave segid columns empty, for chain B leave column 22 empty and set B in segid columns. When you read it in into MDA, segids array will only contain 'B' segment. Just discovered that by chance as I copied data from two different sources (one containing only chain id in column 22, another only segid and column 22 empty). This happens in the newest 1.0.0. |
Does the segids array does not contain an empty segment with segid=''? That would indeed be a problem. However, I am not getting that behaviour on my own toy PDB:
>>> u = mda.Universe("adk_open.pdb")
>>> u.segments
<SegmentGroup with 2 segments>
>>> u.segments.segids
array(['', '4AKE'], dtype=object)
>>> u.atoms.chainIDs
array(['A', '', ''], dtype=object) As @orbeckst noted above, segments aren't chains, nor vice versa. Column 22 is the chain column; columns 73-77 are the segment column. The segids array should contain an array of '' and 'B' in yoru file only, because those are the only segments in the PDB. The chainIDs array should contain an array of 'A' and '' only, because those are the only chains denoted. |
Just to add other related issues that came up over the years |
See #2875 (comment) for chain vs segment. |
@lilyminium you're right! in this mixed case, segids and chainids contain empty items when another is present; sorry for the confusion and thanks for your help! |
I recently had problems with this as well. I agree with @xiki-tempula that Expected behaviorI expect the following one-residue PDB files to contain the same number of residues (one): res1_str = """\
ATOM 659 N THR A 315 22.716 15.055 -1.000 1.00 16.08 A N
ATOM 660 CA THR A 315 22.888 13.803 -0.302 1.00 0.00 C
ATOM 661 C THR A 315 22.006 12.700 -0.882 1.00 0.00 C
ATOM 662 O THR A 315 21.138 12.959 -1.727 1.00 16.25 A O
ATOM 663 CB THR A 315 22.481 13.956 1.182 1.00 0.00 C
ATOM 664 CG2 THR A 315 23.384 14.924 1.927 1.00 0.00 C
ATOM 665 OG1 THR A 315 21.172 14.548 1.274 1.00 0.00 O
"""
res2_str = """\
ATOM 659 N THR A 315 22.716 15.055 -1.000 1.00 16.08 A N
ATOM 660 CA THR A 315 22.888 13.803 -0.302 1.00 15.13 A C
ATOM 661 C THR A 315 22.006 12.700 -0.882 1.00 15.69 A C
ATOM 662 O THR A 315 21.138 12.959 -1.727 1.00 16.25 A O
ATOM 663 CB THR A 315 22.481 13.956 1.182 1.00 16.22 A C
ATOM 664 CG2 THR A 315 22.874 15.310 1.747 1.00 17.32 A C
ATOM 665 OG1 THR A 315 21.047 13.922 1.304 1.00 15.14 A O
""" The only differences between the two residues should be the values of some coordinates and of some b-factors. Non-standard fields (according to the PDB file format) should make no difference. Actual behaviorI think the first residue is split according to
This is, to me, quite unexpected (and possibly wrong, according to the standard?). Code to reproduce the behaviorfrom io import StringIO
import MDAnalysis as mda
print(mda.__version__)
res1_str = """\
ATOM 659 N THR A 315 22.716 15.055 -1.000 1.00 16.08 A N
ATOM 660 CA THR A 315 22.888 13.803 -0.302 1.00 0.00 C
ATOM 661 C THR A 315 22.006 12.700 -0.882 1.00 0.00 C
ATOM 662 O THR A 315 21.138 12.959 -1.727 1.00 16.25 A O
ATOM 663 CB THR A 315 22.481 13.956 1.182 1.00 0.00 C
ATOM 664 CG2 THR A 315 23.384 14.924 1.927 1.00 0.00 C
ATOM 665 OG1 THR A 315 21.172 14.548 1.274 1.00 0.00 O
"""
res2_str = """\
ATOM 659 N THR A 315 22.716 15.055 -1.000 1.00 16.08 A N
ATOM 660 CA THR A 315 22.888 13.803 -0.302 1.00 15.13 A C
ATOM 661 C THR A 315 22.006 12.700 -0.882 1.00 15.69 A C
ATOM 662 O THR A 315 21.138 12.959 -1.727 1.00 16.25 A O
ATOM 663 CB THR A 315 22.481 13.956 1.182 1.00 16.22 A C
ATOM 664 CG2 THR A 315 22.874 15.310 1.747 1.00 17.32 A C
ATOM 665 OG1 THR A 315 21.047 13.922 1.304 1.00 15.14 A O
"""
res1 = mda.Universe(StringIO(res1_str), format='PDB')
res2 = mda.Universe(StringIO(res2_str), format='PDB')
print(res1.residues)
print(res2.residues)
assert len(res1.residues) == len(res2.residues)
|
As an FYI (see #947 (comment) ): SEGID in 73-77 was defined in the ATOM record in the PDB deposition guide, so it's not quite true that SEGID is "non-standard". (Unfortunately, the page that I originally linked in the comment does not seem to exist anymore.) The issue remains that we haven't really made any decisions on how to treat segid vs chain, except that in practice we did not change the behavior that we will only use SEGID to label segments (or otherwise create our own). This may lead to wrong behavior as shown in #2874 (comment) above. |
solution for the bugs in #4965 (under discussion in #4948) But, a test to show it works for prioritizing the chain ID for segment id. Codes:
Outputs:
|
…o segid / Universe: set groups) (#4965) * Fixes #4948 #2874 * Universe: add a function in the universe set_groups to update the topology and groups in the universe when atomwise resids/segids is given (more generalized method across different file type) * PDBParser: add an argument force_chainids_to_segids in PDBReader to forcibly reading the chain ID into the segment ID (PDB specific method) * update docs * add tests * update CHANGELOG --------- Co-authored-by: Cédric Bouysset <bouysset.cedric@gmail.com> Co-authored-by: Lily Wang <31115101+lilyminium@users.noreply.github.com>
Closed by PR #4965 |
Expected behavior
For a PDB with the following line
'ATOM 3516 N THR B 1 23.222 35.476 73.074 0.00 0.00 A N'
Since location 22 is B, the segment id of this residue should be B.
However, when this PDB is loaded to the PDB reader the segment id is recognized as A.
I was checking the PDB standard http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
and I think the character identifying the chain id should be
22 Character chainID Chain identifier.
.Should we prioritize the chain ID showing in location 22 instead of the string in other locations?
Current version of MDAnalysis
python -c "import MDAnalysis as mda; print(mda.__version__)"
) 1.0.0python -V
)? 3.8The text was updated successfully, but these errors were encountered: