cdkbook

Atom types

Graph theory is nice, but we are, of course, interested in chemistry. While graph theory has its limitations, we can do a lot of interesting things with just the vertex-edge formalism. Particularly, if we combine it with the concept of atom types.

An atom type is a concept to describe certain properties of the atom. For example, force fields use atom types to describe geometrical and interaction properties of the atoms in a molecule. Within such formalism, a sp3 carbon is a carbon with four neighbors organized in a tetrahedral coordination, as depicted in Figure 13.1.


Figure 13.1: 3D structure of methane, showing a sp³ carbon surrounded byfour hydrogens. Image from Wikipedia: File:Methane-CRC-MW-dimensions-2D.png (public domain).

The CDK atom type model

A complete description for the atom types of the following atomic properties is needed by the various algorithms in the CDK:

For example, the carbon in methane, we can list these properties with this code:

Script 12.1 code/CDKAtomTypeProperties.groovy

factory = AtomTypeFactory.getInstance(
  "org/openscience/cdk/dict/data/cdk-atom-types.owl",
  SilentChemObjectBuilder.getInstance()
);
IAtomType type = factory.getAtomType("C.sp3");
println "element       : $type.symbol"
println "formal change : $type.formalCharge"
println "hybridization : $type.hybridization"
println "neighbors     : $type.formalNeighbourCount"
println "lone pairs    : " +
  type.getProperty(CDKConstants.LONE_PAIR_COUNT)
println "pi bonds      : " +
  type.getProperty(CDKConstants.PI_BOND_COUNT)

We will see the carbon has these properties:

element       : C
formal change : 0
hybridization : SP3
neighbors     : 4
lone pairs    : 0
pi bonds      : 0

For a carbon in benzene (C.sp2), it would list:

element       : C
formal change : 0
hybridization : SP2
neighbors     : 3
lone pairs    : 0
pi bonds      : 1

And for the oxygen in hydroxide (C.minus), it would give:

element       : O
formal change : -1
hybridization : SP3
neighbors     : 1
lone pairs    : 3
pi bonds      : 0

A full list of CDK atom types is given in a table in Appendix A.

Hybridization Types

The CDK knows about various hybridization types. Hybridizations are linear combinations of atomic orbitals and typically used to explain the orientation of atoms attached to the central atom. For example, Figure 13.1 showed one possible hybridization, sp3.

The list of supported hybridization types can be listed with:

Script 12.2 code/HybridizationTypes.groovy

IAtomType.Hybridization.each {
  println it
}

listing these types:

S
SP1
SP2
SP3
PLANAR3
SP3D1
SP3D2
SP3D3
SP3D4
SP3D5

Atom type perception

Because so many cheminformatics algorithms depend on atom type information, determining the atom types of the atoms in a molecule is typically a very first step, after a molecule has been created. When the CDK is not able to recognize (perceive) the atom type, then this will most certainly mean that the output of cheminformatics algorithms in undefined. The following two sections will describe how atom types can be perceived. It will also be shown what happens when the atom type cannot be recognized.

Single atoms

Instead of perceiving atom types for all atoms in the molecule, one may also perceive the type of a single atom. The former is more efficient when types need to be perceived for all atoms, but when the molecule only partly changed, it can be worthwhile to only perceive atom types for only the affected atoms:

Script 12.3 code/AtomTypePerception.groovy

molecule = new AtomContainer();
atom = new Atom(Elements.CARBON);
molecule.addAtom(atom);
matcher = CDKAtomTypeMatcher.getInstance(
  DefaultChemObjectBuilder.getInstance()
);
type = matcher.findMatchingAtomType(molecule, atom);
AtomTypeManipulator.configure(atom, type);
println "Atom type: $type.atomTypeName"

This reports the perceived atom type for the carbon:

Atom type: C.sp3

Full molecules

Because atom type perception requires the notion of ring systems, with each atom type being perceived individually, using the above approach ring detection must be done each time the atom type is perceived for each atom. Theoretically, this information can be cached, but there currently is no suitable solution for this in the CDK. Therefore, perceiving atom types for all atoms in a molecule can be done more efficiently with the following code:

Script 12.4 code/AtomTypePerceptionMolecule.groovy

matcher = CDKAtomTypeMatcher.getInstance(
  DefaultChemObjectBuilder.getInstance()
);
types = matcher.findMatchingAtomTypes(molecule);

Configuring the Atom

We saw earlier how the AtomTypeManipulator class was used to configure an atom with the configure(IAtom, IAtomType) method. This class also has a convenience method to perceive and configure all atoms in a molecule with one call:

No atom type perceived?!

What happens when the findMatchingAtomType method does not find a suitable atom type, is that it returns a generic ‘X’ atom type:

Script 12.5 code/AtomTypeX.groovy

molecule = new AtomContainer();
atom = new PseudoAtom("G");
molecule.addAtom(atom);
type = matcher.findMatchingAtomType(molecule, atom);
println "Atom type: $type.atomTypeName"

This code example shows that it does not recognize an atom with the element symbol “G”:

Atom type: X

There are several reasons why atom types cannot be perceived, including:

The CDK library had knowledge about a lot of atoms types (see Appendix ??), but there are still gaps. It might be that the CDK simply does not know about an atom type that is present in your input. This can particularly be expected when using elements other than the typical ‘organic chemistry’ elements like carbon, nitrogen, and oxygen. Sulfur and phosphorus are already tricky, and metals the library only knows about a few of them.

However, another reason why the method can return X, is that the input passed is incorrect. In fact, this is one primary application of the CDK atom type perception: to identify errors in the input. An example erroneous input is the below, uncharged NH4. If it is attempted to perceive an atom type for this nitrogen, then the findMatchingAtomType method will in fact return X, as intended:

Script 12.6 code/UnchargedNitrogenPerception.groovy

molecule = new AtomContainer();
atom = new Atom(Elements.NITROGEN);
molecule.addAtom(atom);
hydrogen = new Atom(Elements.HYDROGEN);
molecule.addAtom(hydrogen);
molecule.addBond(0,1,Order.SINGLE);
hydrogen = new Atom(Elements.HYDROGEN);
molecule.addAtom(hydrogen);
molecule.addBond(0,2,Order.SINGLE);
hydrogen = new Atom(Elements.HYDROGEN);
molecule.addAtom(hydrogen);
molecule.addBond(0,3,Order.SINGLE);
hydrogen = new Atom(Elements.HYDROGEN);
molecule.addAtom(hydrogen);
molecule.addBond(0,4,Order.SINGLE);
type = matcher.findMatchingAtomType(molecule, atom);
println "Atom type: $type.atomTypeName"

This is visible from the output it gives:

Atom type: X

Now, if we know the input indeed has errors like this, we can correct for them programmatically. It is important to realize that your algorithm to do that may make mistakes too, and it is adviced to make the detection of the known errors in the input as explicit and detailed as possible. That may slow down your code a bit, but will greatly reduce the chance of introducing error.

The following code is very general and may easily make mistakes. For each atom for which no atom type was perceived, it increases the charge of the atom and tries to perceive the atom type again. This will certainly address the aforementioned nitrogen problem:

Script 12.7 code/CorrectedNitrogenPerception.groovy

type = matcher.findMatchingAtomType(molecule, atom);
if (type.atomTypeName == "X") {
  // try a positive charge
  charge = atom.getFormalCharge()
  if (charge == null | charge == 0) {
    atom.setFormalCharge(+1)
    type = matcher.findMatchingAtomType(molecule, atom);
  }
}
println "Atom type: $type.atomTypeName"

After this programmatic correction, we now find the proper atom type for the nitrogen:

Atom type: N.plus

Sybyl atom types

The Sybyl atom type list is well-known for its application in then mol2 file format (see the Mol2Format class) and used in force fields [1]. Sybyl atom types can be perceived with the SybylAtomTypeMatcher class, which perceives CDK atom types and then translates this in to Sybyl atom types:

Script 12.8 code/SybylAtomTypePerception.groovy

molecule = new AtomContainer();
atom = new Atom(Elements.CARBON);
molecule.addAtom(atom);
matcher = SybylAtomTypeMatcher.getInstance(
  DefaultChemObjectBuilder.getInstance()
);
type = matcher.findMatchingAtomType(molecule, atom);
AtomTypeManipulator.configure(atom, type);
println "Atom type: $type.atomTypeName"

This will give you the Sybyl atom type for carbon in methane:

Atom type: C.3

A full list of Sybyl atom types is given in a table in Appendix A.2.

References

  1. Clark M, Cramer RD, Van Opdenbosch N. Validation of the general purpose tripos 5.2 force field. J Comput Chem. 1989 Dec;10(8):982–1012. doi:10.1002/JCC.540100804 (Scholia)