Cheminformatics is about molecular properties and chemistry in general the field of finding chemicals with new properties. Prof. Gasteiger in 2006 gave a lecture at Cologne University where he expressed this view. It stuck around. We keep databases to store those properties, and we develop methods to predict and understand those properties. Prediction is important for one reason: there are too many chemical structures and we cannot experimentally measure the properties for all of them. The number of molecules is often said to be relevant to drug discovery is in the order of 1060. The largest current databases have less than 108 structures. That means that prediction of properties for the vast majority of molecules will remain relevant for the foreseeable future.
This chapter will show how the CDK can be used to calculate a number of molecular properties.
The simplest but perhaps the most reported molecular property is the molecular mass. It is important to realize this mass is not constant, and depends on the natural mixture of isotopes, which is not constant itself. If you have an atom container with explicit hydrogens, you can loop over the atoms to calculate the molecular mass as summation of the masses of the individual atoms:
Script 16.1 code/CalculateMolecularWeight.groovy
molWeight = 0.0
for (atom in molecule.atoms()) {
molWeight += isotopeInfo.getNaturalMass(atom)
}
In this case, you can also use the AtomContainerManipulator
:
Script 16.2 code/CalculateMolecularWeightShort.groovy
molWeight = AtomContainerManipulator
.getNaturalExactMass(molecule)
The element masses are calculated from the accurate isotope masses and natural abundances defined in the Blue Obelisk Data Repository [1].
If your atom container has implicit hydrogens specified, you will have the above code will not be sufficient. Instead, your code should look like:
Script 16.3 code/CalculateMolecularWeightImplicitHydrogens.groovy
molWeight = 0.0
hWeight = isotopeInfo.getNaturalMass(Elements.HYDROGEN)
for (atom in molecule.atoms()) {
molWeight += isotopeInfo.getNaturalMass(atom)
if (atom.getImplicitHydrogenCount() != CDKConstants.UNSET)
molWeight += atom.getImplicitHydrogenCount() *
hWeight
}
The partition coefficient describes how a molecular structure distributes itself over two immiscible solvents. The logarithm of the partition coefficient (LogP) between octanol and water is often used in cheminformatics to describe hydrophobicity [2,3]. Wikipedia gives this equation. This equation assumes that the solute is neutral, which may involve changing the pH of the water.
The CDK has implemented an algorithm based on the XLogP algorithm [4,5]. The
code is available via the descriptor API. It can be used to calculate the LogP for a single
molecule. The implementation expects explicit hydrogens, so you need to add those if not
present yet (see Section 15.4). The calculation returns a DoubleResult
following the descriptor API:
oxazone = MoleculeFactory.makeOxazole();
benzene = MoleculeFactory.makeBenzene();
// add explicit hydrogens ...
descriptor = new XLogPDescriptor()
println "LogP of oxazone: " +
((DoubleResult)descriptor.calculate(oxazone).value)
.doubleValue()
println "LogP of benzene: " +
((DoubleResult)descriptor.calculate(benzene).value)
.doubleValue()
which returns:
LogP of oxazone: -0.14800000000000002
LogP of benzene: 2.082
An alternative is the more recent algorithm by Plante [6].
Another properties that frequently returns in cheminformatics is the Total Polar Surface Area (TPSA). The code in the CDK uses an algorithm published by Ertl in 2000 [7]. Here too, the descriptor API is used, so that the code is quite similar to that for the logP calculation:
oxazone = MoleculeFactory.makeOxazole();
benzene = MoleculeFactory.makeBenzene();
// add explicit hydrogens ...
descriptor = new TPSADescriptor()
println "TPSA of oxazone: " +
((DoubleResult)descriptor.calculate(oxazone).value)
.doubleValue()
println "TPSA of benzene: " +
((DoubleResult)descriptor.calculate(benzene).value)
.doubleValue()
which returns:
TPSA of oxazone: 21.59
TPSA of benzene: 0.0
Quite related to the TPSA discussed in the previous section, is the actual molecular volume
itself. Zhao et al. proposed a simple, additive model to estimate the van der waals volume
of molecules [8], though their method is restricted to molecules with only these elements:
H, C, N, O, F, Cl, Br, I, P, S, As, B, Si, Se, and Te. The additive method is based on averaged
volumes of the bonds, corrected for the number of rings. The bond volumes are specific for
particular element-element combinations, which explains why their approach only works for the
aforementioned list of elements. The full method is implemented by the VABCVolume
class,
which does not use the descriptor API, so that we can simply use the following method:
Script 16.6 code/VABCVolumes.groovy
methane = smilesParser.parseSmiles("C");
println "Methane volume = " +
VABCVolume.calculate(methane);
ethane = smilesParser.parseSmiles("CC");
println "Ethane volume = " +
VABCVolume.calculate(ethane);
This code gives us the volumes of methane and ethane in cubic Ångström:
Methane volume = 25.8524433266667
Ethane volume = 43.14842795253341
I am not fond of the aromaticity concept; first of all, because there is no universal definition. Most cheminformatics toolkits have different definitions of aromaticity, and so does the CDK. If a compound is aromatic, and if so, which atoms and bonds are involved in an aromatic system are not easily defined. Ultimately, it is the delocalization energies that has a large influence on this, which are hard to reproduce with heuristic rules in chemical graph theory-based algorithms.
The CDK now implements various models, all accessible via the Aromaticity
class.
The models are parameterized: what rings are taken into account, and how many electrons do
various atoms contribute. The combination of these two aspect explains most of the differences
in aromaticity as calculated with various cheminformatics libraries. The CDK can calculate
most of them by selecting the right ElectronDonation
and CycleFinder
:
Script 16.7 code/AromaticityDemo.groovy
model = ElectronDonation.daylight();
cycles = Cycles.or(Cycles.all(), Cycles.all(6));
aromaticity = new Aromaticity(model, cycles);
aromatic = aromaticity.apply(mol);
println "benzene is " +
(aromatic ? "" : "not ") + "aromatic."
which tells us that
benzene is aromatic.
Furthermore, if you wish to know which bonds are aromatic, the same class can be used:
Script 16.8 code/AromaticBonds.groovy
aromaticBonds = aromaticity.findBonds(mol)
count = aromaticBonds.size()
println "benzene has " + count + " aromatic bonds."
which reports the aromatic bonds:
benzene has 6 aromatic bonds.