Advanced information about templates#
The interpretation of the valence (bonds) and formal charge of atoms is an essential step when parsing a PDB/CIF file, and the accuracy of residue mapping is crucial to the creation of a macrobiomolecule system. In Meeko, the input residue names are used as keys and the chemical templates are retrieved accordingly based on templates.
For the command line script for receptor preparation, mk_prepare_receptor.py
, there are three major ways of obtaining such templates:
(1) Load from the default JSON file: Meeko/meeko/data/residue_chem_templates.json
This is the default residue template set curated by us, including:
the standard residues of proteins, RNAs and DNAs,
the modified residues from the following Amber24 OFF library files:
lib
├── amino19.lib
├── amino19ipq_0.9.lib
├── aminoct19ipq_0.9.lib
├── aminont19ipq_0.9.lib
├── phosaa19SB.lib
├── mod_amino19.lib
├── RNA.lib
├── terminalphos.LJbb-RNA.lib
├── DNA.OL15.lib
├── parmBSC1.lib
└── all_modrna08.lib
the ligands in the CCD (Chemical Component Dictionary) that have conflicting names with the above residues.
(2) Load from an additional JSON file by argument --add_templates
, e.g., --add_templates Meeko/meeko/data/NAKB_templates.json
This is an optional add-on template set generated by us, based on the curated set of modified nucleotides by Nucleic Acid Knowledgebase (NAKB).
(3) Fetch from PDB by CCD name during runtime
When an unknown residue is encountered, mk_prepare_receptor.py
attempts to resolve its chemical identity by fetching a definition CIF file from PDB (Protein Data Bank) and generates chemical templates of all possible embedding forms of it when there are inter-residue bonds. Currently, this is an automated yet relatively simple process that only supports noncovalent ligands and residues with unmodified backbones.
Here, we present a quick guide to building potentially complicated residue templates using the meeko.chemtempgen
submodule. In this example, we will prepare a residue with modified backbone: CRO
, a naturally occurring fluorophore in green fluorescent proteins, formed by condensation of three consecutive residues Ser-Tyr-Gly.
Example usage#
Before we start, we will import the required modules and, optionally, suppress excess rdkit loggings that may occur during the editing of molecular structures. Then, we will create a ChemicalComponent
from a definition CIF file, which will be obtained by fetch_from_pdb
(Internet connection is required).
from meeko.chemtempgen import *
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit import RDLogger
from PIL import Image
import io
import copy
import logging
import sys
rdkit_logger = RDLogger.logger()
rdkit_logger.setLevel(RDLogger.CRITICAL)
# Create a chemical component from the definition CIF file
basename = "CRO"
CRO_from_cif = ChemicalComponent.from_cif(fetch_from_pdb(basename), basename)
The created ChemicalComponent
object, CRO_from_cif
, has a corresponding RDKit molecule where the atom names from the definition CIF file are stored under the "atom_id"
property per atom. For a quick check, you may draw the RDKit molecule with noted atom names:
def draw_cc_mol(cc_mol: Chem.Mol):
# Label atoms by atom name
for atom in cc_mol.GetAtoms():
atom.SetProp("atomNote", atom.GetProp("atom_id"))
# Draw the molecule
drawer = Draw.MolDraw2DCairo(600, 600)
drawer.DrawMolecule(cc_mol)
drawer.FinishDrawing()
# Get the image as PNG
png_data = drawer.GetDrawingText()
img = Image.open(io.BytesIO(png_data))
img.show()
draw_cc_mol(CRO_from_cif.rdkit_mol)
As we may see from the picture above, in order to forge CRO
into a linking embedded fragment in a protein, some atoms need to be removed. In this example, we will simply do so by specifying the atom names. make_embedded
calls function embed
on the duplicated object cc
, which takes embed_allowed_smarts
as the editable zone and removes atoms matching the names in leaving_names
. Here, the embed_allowed_smarts
is chosen to be the SMARTS of altered backbone in residue CRO
. Note that by default, embed
removes associated hydrogens for convenience. Therefore, in this case, leaving_names = {"H2", "OXT"}
removes atoms H2
, OXT
as well as the bonded hydrogen, HXT
. The equivalent SMARTS pattern could alternatively do the same task.
cc = copy.deepcopy(CRO_from_cif)
embed_allowed_smarts = "[NX2][CX4][CX3][NX3][CX4][CX3](=O)[OX2]"
cc = cc.make_embedded(allowed_smarts = embed_allowed_smarts, leaving_names = {"H2", "OXT"})
draw_cc_mol(cc.rdkit_mol)
Looking at the structure of the edited picture, we will see that the unnecessary atoms have gone and the hydrogens at the broken (blunt) ends become implicit, which is exactly needed to generate the Smiles string for the chemical template. Function make_pretty_smiles
makes the Smiles string with all Hs explicit for the template’s RDKit molecule. Last but not least, we will determine the link_labels
which specifies how CRO
should be connected to other residues. Here, we will use the pattern from a built-in recipe, AA_recipe.pattern_to_label_mapping_standard
, which also applies to all other standard amino acid residues: {'[NX3h1]': 'N-term', '[CX3h1]': 'C-term'}
. Optionally, we can run a ResidueTemplate_check
to see potential problems with the generated template.
cc = (
cc
.make_pretty_smiles()
.make_link_labels_from_patterns(pattern_to_label_mapping = AA_recipe.pattern_to_label_mapping_standard)
)
cc.ResidueTemplate_check()
export_chem_templates_to_json([cc])
export_chem_templates_to_json
returns a JSON string of the residue template, with the corresponding content printed to console:
******************** New Template Built ********************
{
"ambiguous": {
"CRO": ["CRO"]
},
"residue_templates": {
"CRO": {
"smiles": "[H]NC([H])(C1=NC(=C([H])C2=C([H])C([H])=C(O[H])C([H])=C2[H])C(=O)N1C([H])([H])C=O)C([H])(O[H])C([H])([H])[H]",
"atom_name": ["H", "N1", "CA1", "HA1", "C1", "N2", "CA2", "CB2", "HB2", "CG2", "CD1", "HD1", "CE1", "HE1", "CZ", "OH", "HOH", "CE2", "HE2", "CD2", "HD2", "C2", "O2", "N3", "CA3", "HA31", "HA32", "C3", "O3", "CB1", "HB1", "OG1", "HOG1", "CG1", "HG11", "HG12", "HG13"],
"link_labels": {"1": "N-term", "27": "C-term"}
}
}
}
************************************************************
You may now wonder: What if the residue locates at the C- or N-terminal of the protein? Although this is not common for CRO
, we will go with it for demonstration purposes.
To make the N-terminal embedding variant of CRO
:
# Duplicate and start over from the original chemical component
cc_N = copy.deepcopy(CRO_from_cif)
cc_N = (
cc_N
# Remove atom OXT
.make_embedded(allowed_smarts = embed_allowed_smarts, leaving_names = {"OXT"})
# Cap (protonate) atom N
.make_capped(allowed_smarts = embed_allowed_smarts, capping_names = {"N1"}, protonate = True)
# (Re)generate Smiles with all Hs explicit
.make_pretty_smiles()
# Find linker atoms
.make_link_labels_from_patterns(pattern_to_label_mapping = AA_recipe.pattern_to_label_mapping_standard)
)
cc_N.ResidueTemplate_check()
# In case there are already residue templates with the same parent (original) residue name
cc_N.resname += "_N"
export_chem_templates_to_json([cc_N])
In the chained procedure above, we have removed OXT
and protonated N1
, which is done by make_capped
that adds hydrogen(s) to matching atom(s) with specified capping_names
within the region of allowed_smarts
. The expected output from export_chem_templates_to_json
is:
Atom # 0 (N1) in mol doesn't have implicit Hs -> continue with next atom...
Molecule doesn't contain wanted_smarts: [NX3h1] -> continue with next pattern...
Molecule doesn't contain pattern: [NX3h1] -> linker label for N-term will not be made.
******************** New Template Built ********************
{
"ambiguous": {
"CRO": ["CRO_N"]
},
"residue_templates": {
"CRO": {
"smiles": "[H]OC1=C([H])C([H])=C(C([H])=C2N=C(C([H])(N([H])[H])C([H])(O[H])C([H])([H])[H])N(C([H])([H])C=O)C2=O)C([H])=C1[H]",
"atom_name": ["HOH", "OH", "CZ", "CE1", "HE1", "CD1", "HD1", "CG2", "CB2", "HB2", "CA2", "N2", "C1", "CA1", "HA1", "N1", "H", "H2", "CB1", "HB1", "OG1", "HOG1", "CG1", "HG11", "HG12", "HG13", "N3", "CA3", "HA31", "HA32", "C3", "O3", "C2", "O2", "CD2", "HD2", "CE2", "HE2"],
"link_labels": {"30": "C-term"}
}
}
}
************************************************************
To make the C-terminal embedding variant of CRO
:
# Duplicate and start over from the original chemical component
cc_C = copy.deepcopy(CRO_from_cif)
cc_C = (
cc_C
# Deprotonate the carboxylate group
.make_canonical(acidic_proton_loc = {'[H][O][C](=O)': 0})
# Remove atom H2
.make_embedded(allowed_smarts = embed_allowed_smarts, leaving_names = {"H2"})
# (Re)generate Smiles with all Hs explicit
.make_pretty_smiles()
# Find linker atoms
.make_link_labels_from_patterns(pattern_to_label_mapping = AA_recipe.pattern_to_label_mapping_standard)
)
cc_C.ResidueTemplate_check()
# In case there are already residue templates with the same parent (original) residue name
cc_C.resname += "_C"
export_chem_templates_to_json([cc_C])
In the chained procedure above, we have deprotonated the carboxylate group(s) and removed H2
. The deprotonation is done by make_canonical
that deprotonates all protons specified by acidic_proton_loc
, which includes a SMARTS pattern and the index of the proton. chemtempgen.py
also includes a constant acidic_proton_loc_canonical
, which is potentially useful as a universal protocol to deprotonate the acidic protons to get the canonical protonation state at near physiological pH.
# Constants for deprotonate
acidic_proton_loc_canonical = {
# any carboxylic acid, sulfuric/sulfonic acid/ester, phosphoric/phosphinic acid/ester
'[H][O]['+atom+'](=O)': 0 for atom in ('CX3', 'SX4', 'SX3', 'PX4', 'PX3')
} | {
# any thio carboxylic/sulfuric acid
'[H][O]['+atom+'](=S)': 0 for atom in ('CX3', 'SX4')
} | {
'[H][SX2][a]': 0, # thiophenol
}
The expected output is:
Molecule doesn't contain wanted_smarts: [CX3h1] -> continue with next pattern...
Molecule doesn't contain pattern: [CX3h1] -> linker label for C-term will not be made.
******************** New Template Built ********************
{
"ambiguous": {
"CRO": ["CRO_C"]
},
"residue_templates": {
"CRO_C": {
"smiles": "[H]NC([H])(C1=NC(=C([H])C2=C([H])C([H])=C(O[H])C([H])=C2[H])C(=O)N1C([H])([H])C(=O)[O-])C([H])(O[H])C([H])([H])[H]",
"atom_name": ["H", "N1", "CA1", "HA1", "C1", "N2", "CA2", "CB2", "HB2", "CG2", "CD1", "HD1", "CE1", "HE1", "CZ", "OH", "HOH", "CE2", "HE2", "CD2", "HD2", "C2", "O2", "N3", "CA3", "HA31", "HA32", "C3", "O3", "OXT", "CB1", "HB1", "OG1", "HOG1", "CG1", "HG11", "HG12", "HG13"],
"link_labels": {"1": "N-term"}
}
}
}
************************************************************
If you have generated cc
, cc_N
, and cc_C
, you may write them all into one JSON file:
export_chem_templates_to_json([cc, cc_N, cc_C], json_fname = "CRO_templates.json")
And below is the content of CRO_templates.json
, which can be loaded by mk_prepare_receptor --add_templates CRO_templates.json
during receptor preparation:
{
"ambiguous": {
"CRO": ["CRO", "CRO_N", "CRO_C"]
},
"residue_templates": {
"CRO": {
"smiles": "[H]NC([H])(C1=NC(=C([H])C2=C([H])C([H])=C(O[H])C([H])=C2[H])C(=O)N1C([H])([H])C=O)C([H])(O[H])C([H])([H])[H]",
"atom_name": ["H", "N1", "CA1", "HA1", "C1", "N2", "CA2", "CB2", "HB2", "CG2", "CD1", "HD1", "CE1", "HE1", "CZ", "OH", "HOH", "CE2", "HE2", "CD2", "HD2", "C2", "O2", "N3", "CA3", "HA31", "HA32", "C3", "O3", "CB1", "HB1", "OG1", "HOG1", "CG1", "HG11", "HG12", "HG13"],
"link_labels": {"1": "N-term", "27": "C-term"}
},
"CRO_N": {
"smiles": "[H]OC1=C([H])C([H])=C(C([H])=C2N=C(C([H])(N([H])[H])C([H])(O[H])C([H])([H])[H])N(C([H])([H])C=O)C2=O)C([H])=C1[H]",
"atom_name": ["HOH", "OH", "CZ", "CE1", "HE1", "CD1", "HD1", "CG2", "CB2", "HB2", "CA2", "N2", "C1", "CA1", "HA1", "N1", "H", "H2", "CB1", "HB1", "OG1", "HOG1", "CG1", "HG11", "HG12", "HG13", "N3", "CA3", "HA31", "HA32", "C3", "O3", "C2", "O2", "CD2", "HD2", "CE2", "HE2"],
"link_labels": {"30": "C-term"}
},
"CRO_C": {
"smiles": "[H]NC([H])(C1=NC(=C([H])C2=C([H])C([H])=C(O[H])C([H])=C2[H])C(=O)N1C([H])([H])C(=O)[O-])C([H])(O[H])C([H])([H])[H]",
"atom_name": ["H", "N1", "CA1", "HA1", "C1", "N2", "CA2", "CB2", "HB2", "CG2", "CD1", "HD1", "CE1", "HE1", "CZ", "OH", "HOH", "CE2", "HE2", "CD2", "HD2", "C2", "O2", "N3", "CA3", "HA31", "HA32", "C3", "O3", "OXT", "CB1", "HB1", "OG1", "HOG1", "CG1", "HG11", "HG12", "HG13"],
"link_labels": {"1": "N-term"}
}
}
}