-
Notifications
You must be signed in to change notification settings - Fork 713
Add force read chain to segment when reading a PDB file / set groups by ids #4948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It looks to me that
Did you check that MDAnalysis reads your file when you fix the format first? |
Hi @orbeckst, Thanks for your reply. I agree with you that the I also checked with the PDB format ATOM record. Actually, the official document doesn't mention columns 67-76 in both ATOM and HETATM records (in default, should be blanks). A non-standard format might use columns 73-76 for segid but still not mention columns 67-72 (see #947 (comment), some other sources: link1, link2). Thus, how to process these columns in PDB files is kind of a grey area at the moment. However, the current MDAnalysis version 2.8.0 used columns 67-76 as segID (unofficial CHARMM extension ?). I think it is worth modifying the definition to columns 73-76 for segid in order to follow some possible and common PDB format. This will make MDAnalysis more robust and not to load weird information into the segid. I think it helps the design of MDAnalysis to read compatible PDB files correctly downloaded from various sources or generated by different software using this unofficial column. For chain ID and seg ID, you have mentioned the difference in #2875 (comment). I think that is why MDAnalysis keeps both chain ID and seg ID when reading a PDB. And now thankfully, we can use However, I still think adding a feature to forcibly read the chain to the segment (or said prioritise chain ID for seg ID) is worth it from the viewpoint of a MDAnalysis user, or at least generating a warning to let the users know which column is loaded into the segid. An MDAnalysis user usually expects to read PDB's My implementation is to enable the PDBParser.py accepting the kwargs. If an existing variable This implementation does not affect the default setting of |
I had forgotten how much discussion there had been on chainID/SEGID in the past... good that you looked at these issues. Interesting that Chimera labels the SEGID in PDB files as obsolete. The fact that inconsistent SEGID/chainID identifiers can split residues is a bit concerning #2874 (comment) . Given the discussions above (in particular #2874 ) it makes sense to give the user an option to use chainIDs as segids. |
@RMeli and @xiki-tempula , given your opinions in #2874 , do you have some input here? |
This is a difficult problem. I think the issue is if we having conflicting segid and chain id, what do we do about it. Is there a simple way of overwrite the chain id with seg id after the PDB is loaded? |
Hi @xiki-tempula, thanks for commenting on this.
I know the PDB thing is a little bit tricky, but a new user of MDAnalysis (or, as the case may be, a new biomolecular researcher) always tests their PDB files when they first use MDAnalysis (especially, it is classical to demonstrate to the students in related fields). We don't want to confuse them and would like to provide an easy solution to help them get what they want. However, I think there is no simple solution to prioritise the chain ID with the seg ID after the PDB is loaded (probably the same meaning as "without modifying PDB parser or Universe"). It is because the seg ID also affects the selection of the # MDAnalysis.core.universe
...
def _topology_from_file_like(topology_file, topology_format=None,
**kwargs):
parser = get_parser_for(topology_file, format=topology_format)
try:
with parser(topology_file) as p:
topology = p.parse(**kwargs) ### [where the PDBParser is called]
...
...
def _generate_from_topology(universe):
# generate Universe version of each class
# AG, RG, SG, A, R, S
universe._class_bases, universe._classes = groups.make_classes()
# Put Group level stuff from topology into class
for attr in universe._topology.attrs:
universe._process_attr(attr)
# Generate atoms, residues and segments.
# These are the first such groups generated for this universe, so
# there are no cached merged classes yet. Otherwise those could be
# used directly to get a (very) small speedup. (Only really pays off
# the readability loss if instantiating millions of AtomGroups at
# once.)
universe.atoms = AtomGroup(np.arange(universe._topology.n_atoms), universe)
universe.residues = ResidueGroup(
np.arange(universe._topology.n_residues), universe)
universe.segments = SegmentGroup(
np.arange(universe._topology.n_segments), universe)
...
class Universe(object):
...
def __init(self, ...):
...
if not isinstance(topology, Topology) and not topology is None:
self.filename = _check_file_like(topology)
topology = _topology_from_file_like(self.filename,
topology_format=topology_format,
**kwargs)
if topology is not None:
self._topology = topology
else:
# point to Universe.empty instead of making empty universe
raise TypeError('Topology argument required to make Universe. '
'Try Universe.empty(n_atoms, ...) to construct '
'your own Universe.')
_generate_from_topology(self) # make real atoms, res, segments
coordinates = _resolve_coordinates(self.filename, *coordinates,
format=format,
all_coordinates=all_coordinates)
... If not modify anything in PDBParser, what it could be to overwrite the seg id by chain id (version 2.9.0 or below)I tried to test by assigning new segments/residues with previous examples #2874 (comment). The user to overwrite the seg id by chain id (or the other way around) might consider the codes below: Codes: from io import StringIO
from MDAnalysis.core.groups import SegmentGroup, ResidueGroup
import numpy as np
import MDAnalysis as mda
# print(mda.__version__)
res1_str = """\
ATOM 659 N THR A 315 22.716 15.055 -1.000 1.00 16.08 A N
ATOM 660 CA THR A 315 22.888 13.803 -0.302 1.00 0.00 C
ATOM 661 C THR A 315 22.006 12.700 -0.882 1.00 0.00 C
ATOM 662 O THR A 315 21.138 12.959 -1.727 1.00 16.25 A O
ATOM 663 CB THR A 315 22.481 13.956 1.182 1.00 0.00 C
ATOM 664 CG2 THR A 315 23.384 14.924 1.927 1.00 0.00 C
ATOM 665 OG1 THR A 315 21.172 14.548 1.274 1.00 0.00 O
"""
res2_str = """\
ATOM 659 N THR A 315 22.716 15.055 -1.000 1.00 16.08 A N
ATOM 660 CA THR A 315 22.888 13.803 -0.302 1.00 15.13 A C
ATOM 661 C THR A 315 22.006 12.700 -0.882 1.00 15.69 A C
ATOM 662 O THR A 315 21.138 12.959 -1.727 1.00 16.25 A O
ATOM 663 CB THR A 315 22.481 13.956 1.182 1.00 16.22 A C
ATOM 664 CG2 THR A 315 22.874 15.310 1.747 1.00 17.32 A C
ATOM 665 OG1 THR A 315 21.047 13.922 1.304 1.00 15.14 A O
"""
res1 = mda.Universe(StringIO(res1_str), format='PDB')
res2 = mda.Universe(StringIO(res2_str), format='PDB')
print("test1")
print("res1:", res1.segments)
print("res2:", res2.segments)
print('-'*50)
## because the attibutes in u._topology (e.g. n_residues and n_segments) seems not to be modified,
## we need to modify the universe by hands
# modify the segment ID of res1 (atom by atom)
for id, chainID in zip(res1.atoms.ids, res1.atoms.chainIDs):
selection = res1.select_atoms(f"id {id}")
selection.segments.segids = chainID
print("test2")
print("res1:", res1.segments)
print("res2:", res2.segments)
print('-'*50)
# modify the segment group for res1
unique_segids = np.unique(res1.atoms.segments.segids)
res1.segments = SegmentGroup(np.arange(len(unique_segids)), res1)
# Note that, this will lead to a merge of the segment for records of ATOM and HETATM with same segid.
# But, it is fine in this example.
print("test3")
print("res1:", res1.segments)
print("res2:", res2.segments)
assert len(res1.segments) == len(res2.segments)
print('-'*50)
# modify the residue group for res1
n_residues = 0 # initialize the number of residues
for each_segment in res1.segments:
unique_resids = np.unique(each_segment.residues.resids)
n_residues += len(unique_resids)
res1.residues = ResidueGroup(np.arange(n_residues), res1)
print("test4")
print("res1:", res1.residues)
print("res2:", res2.residues)
assert len(res1.residues) == len(res2.residues)
print("test5")
print("res1:", res1.atoms)
print("res2:", res2.atoms)
assert len(res1.atoms) == len(res2.atoms) Expected outcomes:
|
I think the workaround is too complicated for average user. |
@lilyminium given your careful documentation of segid/chainID handling in the userguide, do you have an opinion on this issue here, namely should we provide an option to have the PDBParser set segids from CHAINID, and if yes, what should be taken into account? @marinegor based on your experience with PDB files in structural biology, do you have an opinion here (apart from "don't use PDB, use PDBx/mmcif" ;-))? |
@orbeckst that's pretty much it, to be honest. It's an outdated format that that, moreover, does not really convey the difficulties of working with structural ensembles (although even mmcif does not fully comply with what researchers want -- see https://doi.org/10.1107/S2052252524005098). My honest opinion(s):
|
Thanks for starting this discussion @yuyuan871111, like Oliver said there's been a lot of discussion around segid/chainid in the past. I'll be honest I've never really used PDB files with segments/segids, so I've personally always found chainID more useful. I also would prioritise chainID over segid as it is defined in the official spec, although that would be a big change in behaviour. No strong opinions either way on adding a In both alternative suggestions below they wouldn't be one-liners exactly, but more like a couple lines of code:
|
Exactly. I've been doing X-ray crystallography and cryo-EM for seven years, and came across SEGIDs exactly once. The usecase was, the structure encoded multiple conformations of different residues, and these conformations weren't independent, i.e. if residue 10 was in conformation A, residue 20 couldn't accommodate conformation B. And SEGID was used to assign the conformation -- all residues with SEGID=X were considered as one conformation, and all with SEGID=Y as the other. But as for |
Thanks @marinegor @lilyminium for the valuable suggestions, I will summarize these discussions later and finalize the codes. Before that, one thing I would like to confirm: From my previous comment #4948 (comment), the columns 67-72 in MDAnalysis PDBParser are not defined in the official PDB format document. These columns seem not to have any historical reason to be read as part of segid (in contrast, columns 73-76 are used by Chimera, at least). Additionally, Gemmi follows the columns 73-76 for segids in PDB format (PDB v2). From your experience, did you find anyone or any software read/use/write the information at columns 67-72? (If not, I think this is a definite thing needs to be fixed.)
@lilyminium , thanks for your inspiration of the codes:
The implementation above #4948 (comment) might not consider different chains with the same resids, etc. I think it is better to update topology when new customs segids is provided (or to guess). Thus, I have developed a method to merge "guessing" and "setting" in a function called However, I am not sure it would be better to move the new function as part of |
Thanks for the update in the PR @yuyuan871111, as mentioned there I'm replying here to centralize discussion. Re-reading the discussion I'm pretty open about and happy to ignore columns 67-72 if we can't find any other codes that use them. I think this won't result in backwards incompatibility with round-tripping since IIRC we don't use these columns when we write out (although I could be wrong there). IMO that's the simplest and best solution here and I think I overcomplicated this issue earlier by suggesting alternatives. I'll continue discussing the possible alternatives below, but to be clear I think it might be best to ignore these for now!
One core goal of API design IMO is to be as clear about what each function does by its name as possible. I'm not sure I immediately understand what's going in But as I said, probably overcomplicating things a bit. |
Actually, re-reading the discussion again there are two issues being discussed here that are similar but not the same:
So I suppose it's not overcomplicating to discuss how to do the second. IMO then |
FYI, BioPandas has already fixed this bug in BioPandas/biopandas#133 I do echo that it would be useful to handle these not-that-wrong PDB files more gracefully---at least have a warning. |
Hi @lilyminium, thanks for your comments. I updated the function and named it Thanks @yuxuanzhuang, I also added an warning in PDB parser when the raw chain ids and seg ids are different. If anyone could test the functions, that would be super helpful!!!! Hopefully, my PR will be merged soon. |
Summary of this issueBased on the discussion above, I would like to summarize here. Two issues are going to be solved, but they are highly related to the selection of segments. The solution is at PR #4965:
Examples: import MDAnalysis as mda
# Issue 2 (improvement 1): read PDB forcibly using chainID to replace segid
u = mda.Universe("your/pdb/file", force_chainids_to_segids=True)
# Issue 2 (improvement 2): read a PDB and set groups (residues/segments) to replace segid by chain ID
u = mda.Universe("your/pdb/file")
u.set_groups(atomwise_segids=u.atoms.chainIDs)
# Issue 2 (improvement 2): read a topology file (e.g. GRO) and set segments in the MDAnalysis
# (it is fine, even if there is no seg information in your topology file.)
u = mda.Universe("your/topo/file") # don't have to be PDB
atomwise_chainIDs = ["A", "A", "A", "A", "B", "B"]
u.set_groups(atomwise_segids=atomwise_chainIDs)
# Issue 2 (improvement 2): read a topology file (e.g. GRO) and set new residue IDs in the MDAnalysis
u = mda.Universe("your/topo/file") # don't have to be PDB
atomwise_resids = [1,1,1,1,2,2,2,2,3,3,3,3,3,3,3]
u.set_groups(atomwise_resids=atomwise_resids) |
Additionally, the import MDAnalysis as mda
from io import StringIO
no_chain_str = """\
ATOM 659 N THR 315 22.716 15.055 -1.000 1.00 16.08 A N
ATOM 660 CA THR 315 22.888 13.803 -0.302 1.00 0.00 C
ATOM 661 C THR 315 22.006 12.700 -0.882 1.00 0.00 C
ATOM 662 O THR 316 21.138 12.959 -1.727 1.00 16.25 A O
ATOM 663 CB THR 314 22.481 13.956 1.182 1.00 0.00 B C
ATOM 664 CG2 THR 315 23.384 14.924 1.927 1.00 0.00 C
ATOM 665 OG1 THR 315 21.172 14.548 1.274 1.00 0.00 O
"""
# read the file
no_chain = mda.Universe(StringIO(no_chain_str), format='PDB')
# now add a chainIDs attribute for your chains (if you don't have one already)
no_chain.add_TopologyAttr('chainIDs')
# set chain IDs for the atoms
no_chain.atoms.chainIDs = ['A', 'A', 'A', 'A', 'B', 'B', 'B']
# set empty segment ids
no_chain.set_groups(atomwise_segids = ["" for _ in range(no_chain.atoms.n_atoms)])
# now, there is only one segment with the segid ""
no_chain.segments # <SegmentGroup with 1 segment>
no_chain.segments.segids # array([''], dtype=object)
# write out the file
no_chain.atoms.write('no_chain.pdb') Output of
|
Even CHARMM PDB in the latest docs https://academiccharmm.org/documentation/latest/io#Write reads SEGID only from 73-76:
so I really don't see any reason against fixing the reading of SEGIDs in MDA to only use the official SEGID columns. Thus, 👍 from me for
(EDIT: This would be a Fix of a format spec. We don't need to deprecate anything although we will need to make clear in the CHANGELOG that this may lead to incorrect reading of files that were readable before.) |
This issues was accidentally closed. It's really connected to PR #4965 . |
Is your feature request related to a problem?
I faced an issue when I read my PDB file.
I downloaded the PDB file from RCSB using the
biopandas
's package.When I read it in MDAnalysis, I found unexpected segments and segids. This makes it difficult to select the atoms by chain (segid) and operate the universe at the SegmentGroup level.
I first checked my
MDAnalysis.Universe
. It seems that MDAnalysis has detected the segids and thus uses it in the MDAnalysis object. However, the PDB file I downloaded should not have segid information. Thus, from my expectation, it should use chain ID as seg ID inMDAnalysis.universe
(ref from doc).Next, I opened my PDB file and discovered that the issue stems from the exceeding digit in the tempFactor column.(see line 12-14 in the below pic).

However, for the current MDAnalysis version, there is no direct solution to correct this format issue in the seg ID. This format issue might often occur when other software processes the PDB file.
I suggest adding this feature to the main codebase so that the user can decide which information to load to segment when reading PDB files.
Describe the solution you'd like
The solution is currently available in the forked MDAnalysis repo: see changelog here
Considering the current PDBParser works well to get chain ID, the idea is to simply add a variable called
force_chainids_to_segids
. This will force the PDBParser to use chain ID as the seg ID. The user can decide whether to use it or not. Ifforce_chainids_to_segids=True
, the segments in the Universe are based on chain ID.Describe alternatives you've considered
In the current version of MDAnalysis, the only solution to select by chain is to use (but it seems we can't operate the SegmentGroup properly):
Additional context
The text was updated successfully, but these errors were encountered: