cdkbook

Missing Information

Missing information is common place in chemical file formats and line notations. In many cases this information is implicit to the representation, but recovering it is not always easy, requiring assumptions which may not be true. Examples of missing informations is the lack of bonds in XYZ files, and the removed double bond location information for aromatic ring systems.

Element and Isotope information

When reading files the format in one way or another has implicit information you may need for some algorithms. Element and isotope information is a key example. Typically, the element symbol is provided in the file, but not the mass number or isotope implied. You would need to read the format specification what properties are implicitly meant. The idea here is that information about elements and isotopes is pretty standardized by other organizations such as the IUPAC. Such default element and isotope properties are exposed in the CDK by the classes Elements and Isotopes.

Elements

The Elements class provides information about the element’s atomic number, symbol, periodic table group and period, covalent radius and van der Waals radius and Pauling electronegativity:

Script 14.1 code/ElementsDemo.groovy

Elements lithium = Elements.Lithium
println "atomic number: " + lithium.number()
println "symbol: " + lithium.symbol()
println "periodic group: " + lithium.group()
println "periodic period: " + lithium.period()
println "covalent radius: " + lithium.covalentRadius()
println "Vanderwaals radius: " + lithium.vdwRadius()
println "electronegativity: " + lithium.electronegativity()

For example, for lithium this gives:

atomic number: 3
symbol: Li
periodic group: 1
periodic period: 2
covalent radius: 1.34
Vanderwaals radius: 2.2
electronegativity: 0.98

Isotopes

Similarly, there is the Isotopes class to help you look up isotope information. For example, you can get all isotopes for an element or just the major isotope (a full list of isotopes is available from Appendix B:

Script 14.2 code/HydrogenIsotopes.groovy

isofac = Isotopes.getInstance();
isotopes = isofac.getIsotopes("H");
majorIsotope = isofac.getMajorIsotope("H")
for (isotope in isotopes) {
  print "${isotope.massNumber}${isotope.symbol}: " +
    "${isotope.exactMass} ${isotope.naturalAbundance}%"
  if (majorIsotope.massNumber == isotope.massNumber)
    print " (major isotope)"
  println ""
}

For hydrogen this gives:

1H: 1.007825032 99.9885% (major isotope)
2H: 2.014101778 0.0115%
3H: 3.016049278 0.0%
4H: 4.02781 0.0%
5H: 5.03531 0.0%
6H: 6.04494 0.0%
7H: 7.05275 0.0%

This class is also used by the getMajorIsotopeMass method in the MolecularFormulaManipulator class to calculate the monoisotopic mass of a molecule:

Script 14.3 code/MonoisotopicMass.groovy

molFormula = MolecularFormulaManipulator
  .getMolecularFormula(
    "C2H6O",
    SilentChemObjectBuilder.getInstance()
  )
println "Monoisotopic mass: " +
  MolecularFormulaManipulator.getMajorIsotopeMass(
    molFormula
  )

The output for ethanol looks like:

Script 14.3 code/MonoisotopicMass.groovy

molFormula = MolecularFormulaManipulator
  .getMolecularFormula(
    "C2H6O",
    SilentChemObjectBuilder.getInstance()
  )
println "Monoisotopic mass: " +
  MolecularFormulaManipulator.getMajorIsotopeMass(
    molFormula
  )

Reconnecting Atoms

XYZ files do not have bond information, and may look like:

5
methane
C  0.25700 -0.36300  0.00000
H  0.25700  0.72700  0.00000
H  0.77100 -0.72700  0.89000
H  0.77100 -0.72700 -0.89000
H -0.77100 -0.72700  0.00000

Fortunately, we can reasonably assume bonds to have a certain length, and reasonably understand how many connections and atom can have at most. Then, using the 3D coordinate information available from the XYZ file, an algorithm can deduce how the atoms must be bonded. The RebondTool does exactly that. And, it does it efficiently too, using a binary search tree, which allows it to scale to protein-sized molecules.

Now, the algorithm does need to know what reasonable bond lengths are, and for this we can use the Jmol list of covalent radii, and we configure the atoms accordingly:

Script 14.4 code/CovalentRadii.groovy

methane = new AtomContainer();
methane.addAtom(new Atom("C", new Point3d(0.0, 0.0, 0.0)));
methane.addAtom(new Atom("H", new Point3d(0.6, 0.6, 0.6)));
methane.addAtom(new Atom("H", new Point3d(-0.6,-0.6,0.6)));
methane.addAtom(new Atom("H", new Point3d(0.6,-0.6,-0.6)));
methane.addAtom(new Atom("H", new Point3d(-0.6,0.6,-0.6)));
factory = AtomTypeFactory.getInstance(
  "org/openscience/cdk/config/data/jmol_atomtypes.txt", 
  methane.getBuilder()
);
for (IAtom atom : methane.atoms()) {
  factory.configure(atom);
  println "$atom.symbol -> $atom.covalentRadius"
}

which configures and prints the atoms’ radii:

C -> 0.77
H -> 0.32
H -> 0.32
H -> 0.32
H -> 0.32

Then the RebondTool can be used to rebind the atoms:

Script 14.5 code/RebondToolDemo.groovy

RebondTool rebonder = new RebondTool(2.0, 0.5, 0.5);
rebonder.rebond(methane);
println "Bond count: $methane.bondCount"

The number of bonds it found are reported in the last line:

Bond count: 4

Missing Bond Orders

There are several reasons why bond orders are missing from an input structure. For example, you may be reading a XYZ file and just performed a rebonding as outlined in the previous section. Or, you may be reading SMILES strings with aromatic organic subset atoms, such as c1ccccc1. Or, you may be reading a MDL molfile that uses the query bond order 4 to indicate an aromatic bond.

The latter two situations are, in fact, very common in cheminformatics. Before CDK 1.4.11 we had the DeduceBondSystemTool to find the location of double bonds in such delocalized electron bond systems, but in that 1.4.11 release a new tool was released, the FixBondOrdersTool class, that does a better job, and faster too. Both classes only look for double bond positions in rings, but that covers many common use cases.

The method requires atom types to be perceived already, which is already done when reading SMILES, for example for pyrrole:

Script 14.6 code/FixPyrroleBondOrders.groovy

pyrrole = smilesParser.parseSmiles(
  "c2ccc3n([H])c1ccccc1c3(c2)"
)
fbot = new FixBondOrdersTool()
pyrrole = fbot.kekuliseAromaticRings(pyrrole)

This results in the image given in Figure 15.1.


Figure 15.1: 2D diagram of pyrrole.

Missing Hydrogens

The CDKHydrogenAdder class can be used to add missing hydrogens. The algorithm itself adds implicit hydrogens (see Section 4.5), but we will see how these can be converted into explicit hydrogens. The hydrogen adding algorithm expects, however, that CDK atom types are already perceived (see Section 13.2).

Implicit Hydrogens

Hydrogens that are not vertices in the molecular graph are called implicit hydrogens. They are merely a property of the atom to which they are connected. If these values are not given, which is common in for example SMILES, they can be (re)calculated with:

Script 14.7 code/MissingHydrogens.groovy

adder = CDKHydrogenAdder.getInstance(
  DefaultChemObjectBuilder.getInstance()
);
adder.addImplicitHydrogens(molecule);
println "Atom count: $molecule.atomCount"
println "Implicit hydrogens: $newAtom.hydrogenCount"

which reports:

Atom count: 1
Implicit hydrogens: 4

Explicit Hydrogens

These implicit hydrogens can be converted into explicit hydrogens using the following code:

Script 14.8 code/ExplicitHydrogens.groovy

adder.addImplicitHydrogens(molecule);
println "Atom count: $molecule.atomCount"
println " .. adding explicit hydrogens .."
AtomContainerManipulator.convertImplicitToExplicitHydrogens(
  molecule
);
println "Atom count: $molecule.atomCount"

which reports for the running methane example:

Atom count: 1
 .. adding explicit hydrogens ..
Atom count: 5

2D Coordinates

Another bit of information missing from the input is often 2D coordinates. To generate 2D coordinates, the StructureDiagramGenerator can be used:

Script 14.9 code/Layout.groovy

butanol = smilesParser.parseSmiles("CCC(O)C")
sdg = new StructureDiagramGenerator();
sdg.setMolecule(butanol);
sdg.generateCoordinates(new Vector2d(0, 1));
butanol = sdg.getMolecule();
for (atom in butanol.atoms()) {
  println atom.getSymbol() + ": " +
    atom.getPoint2d()
}

which will generate the coordinate starting with an initial direction:

C: (-1.7763568394002505E-15, -1.3322676295501878E-15)
C: (0.0, 1.4999999999999991)
C: (-1.299038105676657, 2.2500000000000018)
O: (-2.5980762113533165, 1.5000000000000049)
C: (-1.299038105676652, 3.750000000000002)

Unknown Molecular Formula

Mass spectrometry (MS) is a technology where the experiment yields monoisotopic masses for molecules. In order to analyze these further, it is common to convert them to molecular formula. The MassToFormulaTool has functionality to determine these missing formulae. Miguel Rojas-Chertó developed this code for use in metabolomics [1]. Basic usage looks like:

Script 14.10 code/MissingMF.groovy

tool = new MassToFormulaTool(
  SilentChemObjectBuilder.getInstance()
)
mfSet = tool.generate(133.0968);
for (mf in mfSet) {
  println MolecularFormulaManipulator.getString(mf)
}

This will create a long list of possible molecular formula. It is important to realize that it looks only at what molecular formula are possible with respect to the corresponding mass. This means that it will include chemically unlikely molecular formulae:

C3H11N5O
C5H13N2O2
C2H15NO5
CH9N8
H13N4O4
C10H13
C9H11N
CH15N3O4
C6H13O3
C2H11N7
C4H11N3O2
C4H13N4O
C2H9N6O
C6H15NO2
CH13N2O5
H7N9
C8H9N2
H15N5O3
C5H11NO3
C3H13N6
C3H9N4O2
C5H15N3O
C2H13O6
CH7N7O
H11N3O5
C9H9O
C7H7N3
C4H9N2O3
C4H15N5
C2H7N5O2
CH11NO6
H5N8O
C8H7NO
C6H5N4
C5H9O4
C3H7N3O3
CH5N6O2

This is overcome by setting restrictions. For example, we can put restrictions on the number of elements we allow in the matched formulae:

Script 14.11 code/MissingMFRestrictions.groovy

MolecularFormulaRange range =
  new MolecularFormulaRange();
range.addIsotope( ifac.getMajorIsotope("C"), 8, 20);
range.addIsotope( ifac.getMajorIsotope("H"), 0, 20);
range.addIsotope( ifac.getMajorIsotope("O"), 0, 1);
range.addIsotope( ifac.getMajorIsotope("N"), 0, 1);
MolecularFormulaGenerator tool =
  new MolecularFormulaGenerator(
    SilentChemObjectBuilder.getInstance(),
    133.0, 133.1, range
  );
IMolecularFormulaSet mfSet = tool.getAllFormulas();

Now the list looks more chemical:

C11H 133.007825032
C9H11N 133.089149352
C9H9O 133.065339908
C8H7NO 133.052763844

Of course, this is a long way from actual chemical structures. An Open Source structure generator has been a long standing holy grail, and the CDK-based MAYGEN addresses this gap [2], though the also open source Surge is a good bit faster [3].

References

  1. Rojas-Cherto M, Kasper PT, Willighagen E, Vreeken RJ, Hankemeier T, Reijmers TH. Elemental composition determination based on MSn. Bioinformatics. 2011 Jul 14;27(17):2376–83. doi:10.1093/BIOINFORMATICS/BTR409 (Scholia)
  2. Yirik MA, Sorokina M, Steinbeck C. MAYGEN: an open-source chemical structure generator for constitutional isomers based on the orderly generation principle. J Cheminform. 2021 Jul 3;13(1). doi:10.1186/S13321-021-00529-9 (Scholia)
  3. McKay BD, Yirik MA, Steinbeck C. Surge: a fast open-source chemical graph generator. J Cheminform. 2022 Apr 23;14(1). doi:10.1186/S13321-022-00604-9 (Scholia)