Skip to content

Add force read chain to segment when reading a PDB file / set groups by ids #4948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yuyuan871111 opened this issue Mar 6, 2025 · 20 comments · Fixed by #4965 or #5001
Closed

Add force read chain to segment when reading a PDB file / set groups by ids #4948

yuyuan871111 opened this issue Mar 6, 2025 · 20 comments · Fixed by #4965 or #5001

Comments

@yuyuan871111
Copy link
Contributor

Is your feature request related to a problem?

I faced an issue when I read my PDB file.

I downloaded the PDB file from RCSB using the biopandas's package.

# download the files
from biopandas.pdb import PandasPdb

ppdb = PandasPdb().fetch_pdb("5N69")
ppdb.to_pdb(path='./dataset/examples/5N69.pdb', records=['ATOM', 'HETATM'])

When I read it in MDAnalysis, I found unexpected segments and segids. This makes it difficult to select the atoms by chain (segid) and operate the universe at the SegmentGroup level.

Image

I first checked my MDAnalysis.Universe. It seems that MDAnalysis has detected the segids and thus uses it in the MDAnalysis object. However, the PDB file I downloaded should not have segid information. Thus, from my expectation, it should use chain ID as seg ID in MDAnalysis.universe (ref from doc).

Next, I opened my PDB file and discovered that the issue stems from the exceeding digit in the tempFactor column.(see line 12-14 in the below pic).
Image

However, for the current MDAnalysis version, there is no direct solution to correct this format issue in the seg ID. This format issue might often occur when other software processes the PDB file.

I suggest adding this feature to the main codebase so that the user can decide which information to load to segment when reading PDB files.

Describe the solution you'd like

The solution is currently available in the forked MDAnalysis repo: see changelog here

Considering the current PDBParser works well to get chain ID, the idea is to simply add a variable called force_chainids_to_segids. This will force the PDBParser to use chain ID as the seg ID. The user can decide whether to use it or not. If force_chainids_to_segids=True, the segments in the Universe are based on chain ID.

# read the universe in the future
u = mda.Universe(pdb_path, force_chainids_to_segids=True)

Describe alternatives you've considered

In the current version of MDAnalysis, the only solution to select by chain is to use (but it seems we can't operate the SegmentGroup properly):

# for instance, to select chain A
u.select_atoms('chainID A')

# to operate the segment (chain), might need to select atom first rather than using the SegmentGroup directly
u_chainA = u.select_atoms('chainID A')
u_chainA

Additional context

@orbeckst
Copy link
Member

It looks to me that ppdb.to_pdb() has a bug if it writes a PDB file like the one you're showing. PDB files are column-aligned, see the ATOM record

61 - 66        Real(6.2)     tempFactor   Temperature  factor.

Did you check that MDAnalysis reads your file when you fix the format first?

@yuyuan871111
Copy link
Contributor Author

yuyuan871111 commented Mar 11, 2025

Hi @orbeckst,

Thanks for your reply. I agree with you that the ppdb.to_pdb() probably has a bug when writing PDB files. It does not strictly follow the standard PDB format (at least in my case, the b-factor values in some records used columns 62-67 instead of 61-66).

I also checked with the PDB format ATOM record. Actually, the official document doesn't mention columns 67-76 in both ATOM and HETATM records (in default, should be blanks). A non-standard format might use columns 73-76 for segid but still not mention columns 67-72 (see #947 (comment), some other sources: link1, link2). Thus, how to process these columns in PDB files is kind of a grey area at the moment.

However, the current MDAnalysis version 2.8.0 used columns 67-76 as segID (unofficial CHARMM extension ?). I think it is worth modifying the definition to columns 73-76 for segid in order to follow some possible and common PDB format. This will make MDAnalysis more robust and not to load weird information into the segid. I think it helps the design of MDAnalysis to read compatible PDB files correctly downloaded from various sources or generated by different software using this unofficial column.


For chain ID and seg ID, you have mentioned the difference in #2875 (comment). I think that is why MDAnalysis keeps both chain ID and seg ID when reading a PDB. And now thankfully, we can use chainID for selecting atoms.

However, I still think adding a feature to forcibly read the chain to the segment (or said prioritise chain ID for seg ID) is worth it from the viewpoint of a MDAnalysis user, or at least generating a warning to let the users know which column is loaded into the segid. An MDAnalysis user usually expects to read PDB's chain ID into segid (assuming the most common case is the standard PDB file) (also see #2874).

My implementation is to enable the PDBParser.py accepting the kwargs. If an existing variable force_chainids_to_segids exists in kwargs, it will check whether chainids should replace segids.

This implementation does not affect the default setting of PDBParser but just adds a feature for the user to decide whether to prioritise the chainid for segid. This implementation might be more straightforward for a user. But, I'm unsure what the concerns might be if we were to pass an argument (seems a concern from #947 (comment)).

@orbeckst
Copy link
Member

I had forgotten how much discussion there had been on chainID/SEGID in the past... good that you looked at these issues.

Interesting that Chimera labels the SEGID in PDB files as obsolete.

The fact that inconsistent SEGID/chainID identifiers can split residues is a bit concerning #2874 (comment) .

Given the discussions above (in particular #2874 ) it makes sense to give the user an option to use chainIDs as segids.

@orbeckst
Copy link
Member

@RMeli and @xiki-tempula , given your opinions in #2874 , do you have some input here?

@xiki-tempula
Copy link
Contributor

xiki-tempula commented Mar 14, 2025

This is a difficult problem. I think the issue is if we having conflicting segid and chain id, what do we do about it. Is there a simple way of overwrite the chain id with seg id after the PDB is loaded?
That being said, I feel that PDB has been a bad dumping ground for too long and we shall just simply all move to mmcif and keep the PDB parser as simple as possible.

@yuyuan871111
Copy link
Contributor Author

yuyuan871111 commented Mar 14, 2025

Hi @xiki-tempula, thanks for commenting on this.

This is a difficult problem. I think the issue is if we having conflicting segid and chain id, what do we do about it. Is there a simple way of overwrite the chain id with seg id after the PDB is loaded? That being said, I feel that PDB has been a bad dumping ground for too long and we shall just simply all move to mmcif and keep the PDB parser as simple as possible.

I know the PDB thing is a little bit tricky, but a new user of MDAnalysis (or, as the case may be, a new biomolecular researcher) always tests their PDB files when they first use MDAnalysis (especially, it is classical to demonstrate to the students in related fields). We don't want to confuse them and would like to provide an easy solution to help them get what they want.

However, I think there is no simple solution to prioritise the chain ID with the seg ID after the PDB is loaded (probably the same meaning as "without modifying PDB parser or Universe").

It is because the seg ID also affects the selection of the SegmentGroup in MDAnalysis.Universe (see below obviously, topology's _n_segments is used to generate the SegmentGroup in _generate_from_topology() but we can't modify this attribute in topology). Thus, if we want to write the chain ID to the seg ID, we will also need to modify SegmentGroup (probably ResidueGroup as well #2874 (comment)). This means that we need to either add some codes in __init__() of MDAnlaysis.Universe between _topology_from_file_like and _generate_from_topology (I don't prefer this one, because it might lead to potential unawared issues of the topology by other topology parsers) or modify the codes in PDBParser (I think it is the easier solution).

# MDAnalysis.core.universe
...

def _topology_from_file_like(topology_file, topology_format=None,
                             **kwargs):
    parser = get_parser_for(topology_file, format=topology_format)

    try:
        with parser(topology_file) as p:
            topology = p.parse(**kwargs)  ### [where the PDBParser is called]
    ...

...

def _generate_from_topology(universe):
    # generate Universe version of each class
    # AG, RG, SG, A, R, S
    universe._class_bases, universe._classes = groups.make_classes()

    # Put Group level stuff from topology into class
    for attr in universe._topology.attrs:
        universe._process_attr(attr)

    # Generate atoms, residues and segments.
    # These are the first such groups generated for this universe, so
    #  there are no cached merged classes yet. Otherwise those could be
    #  used directly to get a (very) small speedup. (Only really pays off
    #  the readability loss if instantiating millions of AtomGroups at
    #  once.)
    universe.atoms = AtomGroup(np.arange(universe._topology.n_atoms), universe)

    universe.residues = ResidueGroup(
            np.arange(universe._topology.n_residues), universe)

    universe.segments = SegmentGroup(
            np.arange(universe._topology.n_segments), universe)

...

class Universe(object):
    ...
    def __init(self, ...):
        ...
        
        if not isinstance(topology, Topology) and not topology is None:
            self.filename = _check_file_like(topology)
            topology = _topology_from_file_like(self.filename,
                                                topology_format=topology_format,
                                                **kwargs)

        if topology is not None:
            self._topology = topology
        else:
            # point to Universe.empty instead of making empty universe
            raise TypeError('Topology argument required to make Universe. '
                            'Try Universe.empty(n_atoms, ...) to construct '
                            'your own Universe.')

        _generate_from_topology(self)  # make real atoms, res, segments

        coordinates = _resolve_coordinates(self.filename, *coordinates,
                                           format=format,
                                           all_coordinates=all_coordinates)
        ...

If not modify anything in PDBParser, what it could be to overwrite the seg id by chain id (version 2.9.0 or below)

I tried to test by assigning new segments/residues with previous examples #2874 (comment). The user to overwrite the seg id by chain id (or the other way around) might consider the codes below:

Codes:

from io import StringIO
from MDAnalysis.core.groups import SegmentGroup, ResidueGroup
import numpy as np
import MDAnalysis as mda

# print(mda.__version__)

res1_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00  0.00           C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00  0.00           C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00  0.00           C
ATOM    664  CG2 THR A 315      23.384  14.924   1.927  1.00  0.00           C
ATOM    665  OG1 THR A 315      21.172  14.548   1.274  1.00  0.00           O
"""

res2_str = """\
ATOM    659  N   THR A 315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR A 315      22.888  13.803  -0.302  1.00 15.13         A C
ATOM    661  C   THR A 315      22.006  12.700  -0.882  1.00 15.69         A C
ATOM    662  O   THR A 315      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR A 315      22.481  13.956   1.182  1.00 16.22         A C
ATOM    664  CG2 THR A 315      22.874  15.310   1.747  1.00 17.32         A C
ATOM    665  OG1 THR A 315      21.047  13.922   1.304  1.00 15.14         A O
"""

res1 = mda.Universe(StringIO(res1_str), format='PDB')
res2 = mda.Universe(StringIO(res2_str), format='PDB')

print("test1")
print("res1:", res1.segments)
print("res2:", res2.segments)
print('-'*50)

## because the attibutes in u._topology (e.g. n_residues and n_segments) seems not to be modified, 
## we need to modify the universe by hands

# modify the segment ID of res1 (atom by atom)
for id, chainID in zip(res1.atoms.ids, res1.atoms.chainIDs):
    selection = res1.select_atoms(f"id {id}")
    selection.segments.segids = chainID

print("test2")
print("res1:", res1.segments)
print("res2:", res2.segments)
print('-'*50)

# modify the segment group for res1
unique_segids = np.unique(res1.atoms.segments.segids) 
res1.segments = SegmentGroup(np.arange(len(unique_segids)), res1)
# Note that, this will lead to a merge of the segment for records of ATOM and HETATM with same segid.
# But, it is fine in this example.

print("test3")
print("res1:", res1.segments)
print("res2:", res2.segments)
assert len(res1.segments) == len(res2.segments)
print('-'*50)

# modify the residue group for res1
n_residues = 0 # initialize the number of residues
for each_segment in res1.segments:
    unique_resids = np.unique(each_segment.residues.resids)
    n_residues += len(unique_resids)

res1.residues = ResidueGroup(np.arange(n_residues), res1)

print("test4")
print("res1:", res1.residues)
print("res2:", res2.residues)
assert len(res1.residues) == len(res2.residues)

print("test5")
print("res1:", res1.atoms)
print("res2:", res2.atoms)
assert len(res1.atoms) == len(res2.atoms)

Expected outcomes:

test1
res1: <SegmentGroup [<Segment A>, <Segment >, <Segment A>, <Segment >]>
res2: <SegmentGroup [<Segment A>]>
--------------------------------------------------
test2
res1: <SegmentGroup [<Segment A>, <Segment A>, <Segment A>, <Segment A>]>
res2: <SegmentGroup [<Segment A>]>
--------------------------------------------------
test3
res1: <SegmentGroup [<Segment A>]>
res2: <SegmentGroup [<Segment A>]>
--------------------------------------------------
test4
res1: <ResidueGroup [<Residue THR, 315>]>
res2: <ResidueGroup [<Residue THR, 315>]>
test5
res1: <AtomGroup [<Atom 1: N of type N of resname THR, resid 315 and segid A and altLoc >, ...>
res2: <AtomGroup [<Atom 1: N of type N of resname THR, resid 315 and segid A and altLoc >, ...>

@xiki-tempula
Copy link
Contributor

I think the workaround is too complicated for average user.
u = mda.Universe(pdb_path, force_chainids_to_segids=True)
seems to be a good choice.

@orbeckst
Copy link
Member

@lilyminium given your careful documentation of segid/chainID handling in the userguide, do you have an opinion on this issue here, namely should we provide an option to have the PDBParser set segids from CHAINID, and if yes, what should be taken into account?

@marinegor based on your experience with PDB files in structural biology, do you have an opinion here (apart from "don't use PDB, use PDBx/mmcif" ;-))?

@marinegor
Copy link
Contributor

@orbeckst that's pretty much it, to be honest. It's an outdated format that that, moreover, does not really convey the difficulties of working with structural ensembles (although even mmcif does not fully comply with what researchers want -- see https://doi.org/10.1107/S2052252524005098).

My honest opinion(s):

  • if an input PDB file is not compliant with the PDB standard, I would be strictly against adding features that enable us reading this file. I understand that structural biology community over the decades has developed a metric ton of different PDB standards -- however, I don't think MDAnalysis should support any of them, unless it already can read these standards due to historical reasons.
  • I would also be not in favor of adding force_chainids_to_segids to the PDBReader, but would instead suggest adding a function (or a transformation?) that changes a Universe accordingly. My reasoning behind is that possibly multiple Readers need this option (for instance, MMCIF reader in Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712), and implementing it in one place makes more sense to me.

@lilyminium
Copy link
Member

lilyminium commented Mar 16, 2025

Thanks for starting this discussion @yuyuan871111, like Oliver said there's been a lot of discussion around segid/chainid in the past.

I'll be honest I've never really used PDB files with segments/segids, so I've personally always found chainID more useful. I also would prioritise chainID over segid as it is defined in the official spec, although that would be a big change in behaviour. No strong opinions either way on adding a force_chainids_to_segids kwarg that is False by default, other than that it clutters up the initialization arguments a little... so I have some alternative suggestions below. I do agree with @marinegor that if something is added it shouldn't be PDB-specific.

In both alternative suggestions below they wouldn't be one-liners exactly, but more like a couple lines of code:

  • try to exploit the new guessers API to use force_guess to override segment IDs by "guessing" / assigning chainIDs to segIDs ... not 100% sure how easy this would be. In use this could look something like:
u = mda.Universe("test.pdb")
u.segments.segids = "" # or even just ignore any previously read segids
u.guess_TopologyAttrs(to_guess=["segids"], force_guess=["segids"])
  • Make it easier to re-create a universe with different Residue/SegmentGroups, i.e. wrapping @yuyuan871111's workflow behind a method.
u = mda.Universe("test.pdb")
atomwise_segids = u.atoms.chainIDs
u.set_segments_by_id(atomwise_segids)

@marinegor
Copy link
Contributor

I'll be honest I've never really used PDB files with segments/segids, so I've personally always found chainID more useful

Exactly. I've been doing X-ray crystallography and cryo-EM for seven years, and came across SEGIDs exactly once. The usecase was, the structure encoded multiple conformations of different residues, and these conformations weren't independent, i.e. if residue 10 was in conformation A, residue 20 couldn't accommodate conformation B. And SEGID was used to assign the conformation -- all residues with SEGID=X were considered as one conformation, and all with SEGID=Y as the other.

But as for force_chainids_to_segids=True, I do think it's useful to have it in the library. Even phenix crystallographic suite has such option in their pdbtools, which probably means that this is a highly popular step when working with PDB files: https://www.mrc-lmb.cam.ac.uk/public/xtal/doc/phenix/reference/pdbtools.html

@yuyuan871111
Copy link
Contributor Author

yuyuan871111 commented Mar 16, 2025

Thanks @marinegor @lilyminium for the valuable suggestions, I will summarize these discussions later and finalize the codes. Before that, one thing I would like to confirm:

From my previous comment #4948 (comment), the columns 67-72 in MDAnalysis PDBParser are not defined in the official PDB format document. These columns seem not to have any historical reason to be read as part of segid (in contrast, columns 73-76 are used by Chimera, at least). Additionally, Gemmi follows the columns 73-76 for segids in PDB format (PDB v2).

From your experience, did you find anyone or any software read/use/write the information at columns 67-72? (If not, I think this is a definite thing needs to be fixed.)

I also checked with the PDB format ATOM record. Actually, the official document doesn't mention columns 67-76 in both ATOM and HETATM records (in default, should be blanks). A non-standard format might use columns 73-76 for segid but still not mention columns 67-72 (see #947 (comment), some other sources: link1, link2). Thus, how to process these columns in PDB files is kind of a grey area at the moment.

However, the current MDAnalysis version 2.8.0 used columns 67-76 as segID (unofficial CHARMM extension ?). I think it is worth modifying the definition to columns 73-76 for segid in order to follow some possible and common PDB format. This will make MDAnalysis more robust and not to load weird information into the segid. I think it helps the design of MDAnalysis to read compatible PDB files correctly downloaded from various sources or generated by different software using this unofficial column.


@lilyminium , thanks for your inspiration of the codes:

u = mda.Universe("test.pdb")
u.segments.segids = "" # or even just ignore any previously read segids
u.guess_TopologyAttrs(to_guess=["segids"], force_guess=["segids"])
  • Make it easier to re-create a universe with different Residue/SegmentGroups, i.e. wrapping @yuyuan871111's workflow behind a method.
u = mda.Universe("test.pdb")
atomwise_segids = u.atoms.chainIDs
u.set_segments_by_id(atomwise_segids)

The implementation above #4948 (comment) might not consider different chains with the same resids, etc. I think it is better to update topology when new customs segids is provided (or to guess).

Thus, I have developed a method to merge "guessing" and "setting" in a function called guess_or_set_segments() in #4965.
This function can be used on the topology read from different file format by updating the topology with new segids.

However, I am not sure it would be better to move the new function as part of guess_TopologyAttrs or stay as an independent function of the Universe.

@lilyminium
Copy link
Member

Thanks for the update in the PR @yuyuan871111, as mentioned there I'm replying here to centralize discussion.

Re-reading the discussion I'm pretty open about and happy to ignore columns 67-72 if we can't find any other codes that use them. I think this won't result in backwards incompatibility with round-tripping since IIRC we don't use these columns when we write out (although I could be wrong there). IMO that's the simplest and best solution here and I think I overcomplicated this issue earlier by suggesting alternatives.

I'll continue discussing the possible alternatives below, but to be clear I think it might be best to ignore these for now!

guess_or_set_segments

One core goal of API design IMO is to be as clear about what each function does by its name as possible. I'm not sure I immediately understand what's going in guess_or_set_segments just from its name and docstring, e.g. if it respects any previously read-in segids or just overwrites them -- mostly I'm not a huge fan of methods that are ambiguous about what they do. A function called set_segments_by_id or even set_groups to allow for setting residue and segment groups at the same time is a bit clearer about what it does, and the user could fairly easily pass in u.atoms.chainIDs as the segids if that was needed. This could also work in tandem with guessing using the guess_TopologyAttrs / DefaultGuesser methods, potentially with a flag like reset_groups=True if necessary.

But as I said, probably overcomplicating things a bit.

@lilyminium
Copy link
Member

Actually, re-reading the discussion again there are two issues being discussed here that are similar but not the same:

  1. ignore columns 67-72 or include them in reading segid
  2. add a flag or method to Universe or format parsers to use the chainID as the segment ID

So I suppose it's not overcomplicating to discuss how to do the second. IMO then force_chainids_to_segids is a very specific flag and something more general like re/set_groups(atomwise_resids=None, atomwise_segids=None) would be more elegant and useful, but that's not a strong opinion. This latter option also lets us move issues like conflicting segids and chainids into the guessing functionality, where there's already ways to address that (i.e. set force_guess).

@yuxuanzhuang
Copy link
Contributor

FYI, BioPandas has already fixed this bug in BioPandas/biopandas#133

I do echo that it would be useful to handle these not-that-wrong PDB files more gracefully---at least have a warning.

@yuyuan871111
Copy link
Contributor Author

yuyuan871111 commented Mar 19, 2025

Hi @lilyminium, thanks for your comments.

I updated the function and named it set_groups with the two arguments you suggested (atomwise_resids and atomwise_segids). I have tested the function with PDB format and GRO format, and it works. But, not sure whether also works with other universe read from other topology parser.

Thanks @yuxuanzhuang, I also added an warning in PDB parser when the raw chain ids and seg ids are different.

If anyone could test the functions, that would be super helpful!!!!

Hopefully, my PR will be merged soon.

@yuyuan871111
Copy link
Contributor Author

yuyuan871111 commented Mar 19, 2025

Summary of this issue

Based on the discussion above, I would like to summarize here. Two issues are going to be solved, but they are highly related to the selection of segments. The solution is at PR #4965:

Issue 1. ignore columns 67-72 or include them in reading segid

  • Improvement: Ignore columns 67-72 to meet the standard PDB format, and use only 73-76 for segids (this can also avoid redundant characters wrongly loading into segid column)

Issue 2. add a flag or method to Universe or format parsers to use the chainID as the segment ID

  • Improvement 1 (PDB specific function): add an argument force_chainds_to_segids for the user to forcibly use chain ID column as the segment ID when reading PDB file. Then, SegmentGroup will be created correspondingly.
  • Improvement 2 (generaized method in Universe): add a more generalized solution but do a similar thing as solution 1. It is called Universe.set_groups(). Given an atomwise resids and/or segids, the user can easily modify
    (1) the groups of Universe.residues and Universe.segments
    (2) segids, resids in Universe.atoms, Universe.residues, Universe.segments.

Examples:

import MDAnalysis as mda

# Issue 2 (improvement 1): read PDB forcibly using chainID to replace segid
u = mda.Universe("your/pdb/file", force_chainids_to_segids=True)

# Issue 2 (improvement 2): read a PDB and set groups (residues/segments) to replace segid by chain ID
u = mda.Universe("your/pdb/file")
u.set_groups(atomwise_segids=u.atoms.chainIDs)

# Issue 2 (improvement 2): read a topology file (e.g. GRO) and set segments in the MDAnalysis 
# (it is fine, even if there is no seg information in your topology file.)
u = mda.Universe("your/topo/file")  # don't have to be PDB
atomwise_chainIDs = ["A", "A", "A", "A", "B", "B"]
u.set_groups(atomwise_segids=atomwise_chainIDs)

# Issue 2 (improvement 2): read a topology file (e.g. GRO) and set new residue IDs in the MDAnalysis
u = mda.Universe("your/topo/file")  # don't have to be PDB
atomwise_resids = [1,1,1,1,2,2,2,2,3,3,3,3,3,3,3]
u.set_groups(atomwise_resids=atomwise_resids)

@yuyuan871111
Copy link
Contributor Author

yuyuan871111 commented Mar 20, 2025

Additionally, the set_groups function can also deal with the issue Don't add segID to PDB files by using the modified codes from #3505 (comment):

import MDAnalysis as mda
from io import StringIO

no_chain_str = """\
ATOM    659  N   THR   315      22.716  15.055  -1.000  1.00 16.08         A N
ATOM    660  CA  THR   315      22.888  13.803  -0.302  1.00  0.00           C
ATOM    661  C   THR   315      22.006  12.700  -0.882  1.00  0.00           C
ATOM    662  O   THR   316      21.138  12.959  -1.727  1.00 16.25         A O
ATOM    663  CB  THR   314      22.481  13.956   1.182  1.00  0.00         B C
ATOM    664  CG2 THR   315      23.384  14.924   1.927  1.00  0.00           C
ATOM    665  OG1 THR   315      21.172  14.548   1.274  1.00  0.00           O
"""

# read the file
no_chain = mda.Universe(StringIO(no_chain_str), format='PDB')

# now add a chainIDs attribute for your chains (if you don't have one already)
no_chain.add_TopologyAttr('chainIDs')

# set chain IDs for the atoms
no_chain.atoms.chainIDs = ['A', 'A', 'A', 'A', 'B', 'B', 'B']

# set empty segment ids
no_chain.set_groups(atomwise_segids = ["" for _ in range(no_chain.atoms.n_atoms)])

# now, there is only one segment with the segid ""
no_chain.segments    # <SegmentGroup with 1 segment>
no_chain.segments.segids   # array([''], dtype=object)

# write out the file
no_chain.atoms.write('no_chain.pdb')

Output of no_chain.pdb:

...
ATOM      1  N   THR A 315      22.716  15.055  -1.000  1.00 16.08           N  
ATOM      2  CA  THR A 315      22.888  13.803  -0.302  1.00  0.00           C  
ATOM      3  C   THR A 315      22.006  12.700  -0.882  1.00  0.00           C  
ATOM      4  O   THR A 316      21.138  12.959  -1.727  1.00 16.25           O  
ATOM      5  CB  THR B 314      22.481  13.956   1.182  1.00  0.00           C  
ATOM      6  CG2 THR B 315      23.384  14.924   1.927  1.00  0.00           C  
ATOM      7  OG1 THR B 315      21.172  14.548   1.274  1.00  0.00           O  
...

@yuyuan871111 yuyuan871111 changed the title Add force read chain to segment when reading a PDB file Add force read chain to segment when reading a PDB file / set groups by ids Mar 21, 2025
@orbeckst
Copy link
Member

orbeckst commented Mar 26, 2025

Even CHARMM PDB in the latest docs https://academiccharmm.org/documentation/latest/io#Write reads SEGID only from 73-76:

If the SEGI option is used the sequence is read for the atoms that have the corresponding segid in columns 73-76 of the PDB file.

so I really don't see any reason against fixing the reading of SEGIDs in MDA to only use the official SEGID columns. Thus, 👍 from me for

  • Improvement: Ignore columns 67-72 to meet the standard PDB format, and use only 73-76 for segids (this can also avoid redundant characters wrongly loading into segid column)

(EDIT: This would be a Fix of a format spec. We don't need to deprecate anything although we will need to make clear in the CHANGELOG that this may lead to incorrect reading of files that were readable before.)

@orbeckst
Copy link
Member

This issues was accidentally closed. It's really connected to PR #4965 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment