Identify side chain atoms in BioPandas dataframe

Question

I am currently implementing some metrics I could use for comparing two conformations of the same protein in Python.

For example, I know I could use the RMSD of all protein ATOMS in BioPandas using this:

biopandas.pdb.PandasPdb.rmsd(conformation1.df['ATOM'], conformation2.df['ATOM'])

Are there some out-of-the-box tools in Python for measuring the RMSD between the side chains of the two conformations? When trying to do this manually, the main issue is how to identify side chain atoms, which is not clear for me. I know how to get all atoms from a residue using BioPandas:

aminoacids = list(PandasPdb().read_pdb(my_pdb_filename).df['ATOM'].groupby(['chain_id', 'residue_number', 'residue_name']))

Now aminoacids[i] is a dataframe containing all the atoms of the i-th aminoacid, like this:

From here, it's not clear how to separate back bone atoms from side chain atoms. It's clear that the one with atom_number 'CA' will be the alpha-Carbon, but I am not sure how to identify the others. But once the separation is done, the side chain RMSD could be measure for example using Bio.SVDSuperimposer.SVDSuperimposer which works directly on the array of 3D coordinates of the atoms.

Thanks!

O.Laprevote · Accepted Answer · 2022-03-01T12:44:55.907

As you suggest one way of solving your problem would be by selecting all atoms that don't have backbone atoms names. In a pdb file I believe backbone atoms would be named 'CA', 'HA', 'N', 'HN' or 'H', 'C' and 'O'. Beware of the N-terminal (where the hydrogens would be named either 'H1', 'H2' & 'H3'; or 'HN1', 'HN2' and 'HN3') and the C-terminal (the other oxygen would be named 'OXT').

One way of doing it is by using the handy selections from pandas dataframes. It's probably a more elegant way than making a list of dataframes.

from biopandas.pdb import PandasPdb
bpdf_1 = PandasPdb().fetch_pdb('1t48').df['ATOM']
bpdf_2 = PandasPdb().fetch_pdb('1t49').df['ATOM']
Function returns a list of boolean. True when it's not a backbone atom.
sel = lambda df: df['atom_name'].str.contains('[^(CA|HA|N|C|O|HN|H)]$')
The number of atom is not the same, so I only align the first 200
sidechain_rmsd = PandasPdb.rmsd(bpdf_1[sel][:200], bpdf_2[sel][:200])

The regex says something like "(not (CA or HA or …)) end of line". The end of line is important, otherwise any atom containing N, C, O or H in their name would be excluded (so most of them).

Note that with this method it's still possible to use the PandasPdb selectors. For example, to only consider heavy atoms (no hydrogen):

sidechain_rmsd = PandasPdb.rmsd(bpdf_1[sel][:200],
                                bpdf_2[sel][:200],
                                s='heavy')

EDIT:

Another way to do it would be:

sidechain_rmsd = PandasPdb.rmsd(bpdf_1[:200],
                                bpdf_2[:200],
                                s='main chain',
                                invert=True)

You don't need any selection in this case, BUT this is a bad idea if you have hydrogens, as it will also align them (including the backbone ones).

If you want to align each residue one by one instead: I'm sure this gives you the tools to do it, though it would be a pretty bad idea. In this very case I'd advise to try comparing side-chains using dihedral angles (hence rotamers). You can have a look at pyrosetta for this. — O.Laprevote, Mar 01 '22 at 12:53
Thanks for your detailed response! To check that I understood right: the H[0-9]* and HN[0-9]* are the names of the hydrogen atoms in the amino group? So I need to filter them out. — CubeHead, Mar 01 '22 at 13:40
Yes, although you can get away by just filtering out H[1-3] and HN[1-3]. More hydrogen atoms would be disturbing. — O.Laprevote, Mar 01 '22 at 15:55
My bad: H[1-3]? and HN[1-3]?, otherwise you don't cover cases where there is only H or only HN. Remember to keep the $ at the end, otherwise you'd end up matching all H. — O.Laprevote, Mar 01 '22 at 16:19

Identify side chain atoms in BioPandas dataframe

1 Answers1

Function returns a list of boolean. True when it's not a backbone atom.

The number of atom is not the same, so I only align the first 200