6

I have a list of PDB IDs with realtive chains IDs that have to be extracted, and then run on dssp. For the single chain extraction I tried several methods, such as:

import pymol2
for entry in list:
    with pymol2.PyMOL() as pymol:
        pymol.cmd.fetch(entry)
        pymol.cmd.save(entry + '.cif')

Where, for example, if I want to extract chain AC for 4YBB, entry = 4YBBAC. This works well in saving the cif files (pymol correctly visualizes them), but makes them not suitable for mkdssp. Indeed, if i run

mkdssp 4YBBAC

I get

DSSP could not be created due to an error:
Is this an mmCIF file?

I tried also using Biopython:

from Bio.PDB import MMCIFParser
from Bio.PDB.mmcifio import MMCIFIO
parser = MMCIFParser()
io = MMCIFIO()
structure = parser.get_structure("4YBB", "4YBB.cif")
model = structure[0]
chain = model['AC']
io.set_structure(chain)
io.save("4YBBAC.cif")

But again, when running mkdssp, I get:

DSSP could not be created due to an error:
bad lexical cast: source type value could not be interpreted as target

On a closer inspection, both methods drop the columns _atom_site.pdbx_formal_charge _atom_site.auth_comp_id _atom_site.auth_atom_id from the file, while retaining all the others. This is the first line of the coordinates from the biopython/pymol file:

ATOM   1    N N   . GLY A ? 1   ? -125.713 46.378 -19.108 1.0 77.15  2   AC 1 

While this is the original file:

ATOM   34684  N  N     . GLY C   3  1    ? -125.713 46.378   -19.108  1.00 77.15  ?  2    GLY AC N     1

I think that renaming the AC chain to a single-character code would in principle resolve the issue, but it's more of a workaround.

mkdssp does support multicharacter chain ids. I've proved it first it by cutting and pasting the chain into the file and then with an ad-hoc parser. I think the issue is with how biopython and pymol extract the chains.

Is this a bug? Am I using the tools incorrectly? Can my purpose be achieved with other (possibly hassle free) methods?

MORE ON THIS TOPIC ON Biopython GitHub :

https://github.com/biopython/biopython/issues/3439

pippo1980
  • 1,088
  • 3
  • 14
saiden
  • 171
  • 4
  • Could you post the PDBx/mmCIF loop_ before the coordinate lines of the pre and post structures/chains to better show what dropping the column mean – pippo1980 Jan 30 '23 at 22:18
  • Biopython 1.81 development https://biopython.org/docs/dev/api/Bio.PDB.DSSP.html has a module for DSSP – pippo1980 Jan 31 '23 at 14:10
  • to me looks like https://github.com/biopython/biopython/blob/master/Bio/PDB/MMCIFParser.py is missing 'pdbx_formal_charge' while https://github.com/biopython/biopython/blob/master/Bio/PDB/mmcifio.py has it. – pippo1980 Feb 01 '23 at 18:18
  • If you are you using PyMOL, why not use its command dss? PyMOL is annoying for removing most entries bar for ATOM/HETATM data from mmCIF and PDB filenames. So the SS data would not be saved, but it's easy to append to the file (I have done it here for example as NGL.js requires SHEET and HELIX) – Matteo Ferla Feb 03 '23 at 09:57
  • @saidem why do you think ; "I think that renaming the AC chain to a single-character code would in principle resolve the issue, but it's more of a workaround." , have you tried it ? – pippo1980 Jul 01 '23 at 15:33
  • not sure if its relevant didi you checked laast verision of parser ?? https://biopython.org/docs/1.80/api/Bio.PDB.MMCIFParser.html it has : auth_chains - True by default. If true, use the author chain IDs. If false, use the re-assigned mmCIF chain IDs. AND auth_residues - True by default. If true, use the author residue numbering. If false, use the mmCIF “label” residue numbering, which has no insertion codes, and strictly increments residue numbers. NOTE: Non-polymers such as water don’t have a “label” residue number, and will be skipped. – pippo1980 Jul 01 '23 at 15:43

0 Answers0