cdkbook

InChI

The IUPAC International Chemical Identifier (InChI, http://www.iupac.org/inchi/) is an identifier developed to provide a database-independent, unique identifier for small organic molecules [1]. The CDK uses the JNI-InChI library by Adams (http://jni-inchi.sf.net/) to provides a Java layer on top of the open source InChI library written in C. The InChI is design to be unique for molecules, and one InChI always identifies the same molecule, and as such is aimed to be used to look up molecules in databases or on the internet [2,3].

To overcome the common problem caused by tautomerism in database look up, the InChI applies a number of rules to determine what the possible tautomers for a particular chemical graph are. This makes it possible to find ethanal in a database when the less-stable tautomer ethenol was searched. Both give rise to the same InChI, as we will see later.

First, we need to see how we can generate InChIs in the CDK. It starts with an InChIGeneratorFactory to create an InChIGenerator. This generator is then used to run the InChI software on the given molecule. The algorithm might fail, for various reasons, and we need to check if the generation succeeded too:

Script 18.1 code/InChIGeneration.groovy

factory = InChIGeneratorFactory.getInstance();
generator = factory.getInChIGenerator(methane);
if (generator.getReturnStatus() == INCHI_RET.OKAY)
  print generator.getInchi()

which gives the InChI for methane:

InChI=1S/CH4/h1H4

This snippet of code has generated us a Standard InChI. To explain what a Standard InChI is, we first need to briefly look at the layers in InChIs.

Layers

An InChI is like an onion. No, not in the sense that it makes you cry, but in the sense that is has layers {See Shrek). Each layer adds more detailed information to the InChI of a molecule. The aforementioned InChI for methane has a layer reflecting the molecular formula (/CH4) and a hydrogen layer showing the number of hydrogens for each atom (/h1H4). Except for the molecular formula layer, most layers start with a lower case character, as is visible in the hydrogen layer, indicated by the (/h).

Another important thing to note is that hydrogens are not explicitly defined in the connection table (see Section 4.5). Therefore, the InChI for methane does not have a connectivity layer, but formic acid, mierezuur in Dutch, does (/c2-1-3):

InChI=1S/CH2O2/c2-1-3/h1H,(H,2,3)

You see that the connectivity layer shows how the atoms are connected, and this layer it does not give bond orders. The atom numbering follows the molecular formula, where the hydrogens are not numbered. Therefore, the carbon has atom number 1, while the oxygens are atoms 2 and 3.

Now, have a careful look at this InChI for formic acid. Take a few minutes for this, and make sure you fully understand the connectivity and hydrogen layers (the answer is given in code snippet 18.2).

Other layers the InChI supports include those for, for example, stereochemistry. The InChI software has a number of option to enable or disable certain layers. This explains the existence of the Standard InChI. This version of the InChI is created when a particular set of layers is used, allowing the InChI string to be used as unique identifier: because it removes the choice of layers, one molecule always has the same standard InChI, whereas a molecule can have multiple InChI string depending on turning on or off certain layers. However, it is of utmost importance to realize that a particular InChI layer is always unique to the molecule, independent of layers being added or removed.

A Standard InChI string is identified by the 1S version number. If non-standard layers are turned on, the version is simply 1, as we will see shortly.

Fixed Hydrogens

If you had not cheated in the mierezuur exercise, you will have noted that one hydrogen is delocalized: it can be attached to either of the oxygens. This feature is picked up by the InChI algorithm to compensate for certain kinds of tautomerism. If we want to fix the hydrogens to a particular atom, we use the following code:

Script 18.2 code/InChIMierezuurFixed.groovy

factory = InChIGeneratorFactory.getInstance();
generator = factory.getInChIGenerator(
  mierezuur, "FixedH"
);
print generator.getInchi()

which results in this non-standard InChI:

InChI=1/CH2O2/c2-1-3/h1H,(H,2,3)/f/h2H

By adding the FixedH option for the InChI algorithm, we added the fixed hydrogen layer (/f/h2H). This additional layer assigns one mobile hydrogen to the second atom, which is the first oxygen.


Figure 19.1: 2D diagram of one of the tautomers of adenine.

Stereoisomerism

Another interesting layer to look at is the stereoisomerism layer. Particular, because databases often disagree on the exact stereochemistry of molecules, which is weird but commonplace, unfortunately [Williams2012blog]. The standard InChIs for the two stereoisomers of bromo cholo fluoro methane result in two different InChIs:

Script 18.4 code/InChIStereoisomerism.groovy

generator = factory.getInChIGenerator(isomer1)
println generator.inchi
generator = factory.getInChIGenerator(isomer2)
println generator.inchi

The differences are found in the stereochemistry related layers, /t and /m. The first layer captures tetrahedral stereochemistry, while the other layer captures mirror image. And because we started with two mirror image structures, the /t layer is identical, and we the difference in the /m layer:

InChI=1S/CHBrClF/c2-1(3)4/h1H/t1-/m0/s1
InChI=1S/CHBrClF/c2-1(3)4/h1H/t1-/m1/s1

Because of the aforementioned database comparison argument, there is an important use case in comparing InChIs without the stereochemistry layers. To create such InChIs, you can use the SNon option:

Script 18.5 code/InChINoStereoisomerism.groovy

generator = factory.getInChIGenerator(
  isomer1, "Snon"
)
println generator.inchi
generator = factory.getInChIGenerator(
  isomer2, "Snon"
)
println generator.inchi

And then the InChIs for both structures are identical:

InChI=1S/CHBrClF/c2-1(3)4/h1H/t1-/m0/s1
InChI=1S/CHBrClF/c2-1(3)4/h1H/t1-/m1/s1

One important caveat: chiral information as read by the SMILES parser is not currently converted into stereo information for the InChI generation process!

References

  1. Stein SE, Heller SR, Tchekhovski D. An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier. Proceedings of the International Chemical Information Conference, 2003, pp 131-143.
  2. Wohlgemuth G, Haldiya PK, Willighagen E, Kind T, Fiehn O. The Chemical Translation Service–a web-based tool to improve standardization of metabolomic reports. Bioinformatics. 2010 Oct 15;26(20):2647–8. doi:10.1093/BIOINFORMATICS/BTQ476 (Scholia)
  3. Coles SJ, Day NE, Murray-Rust P, Rzepa HS, Zhang Y. Enhancement of the chemical semantic web through the use of InChI identifiers. Organic & Biomolecular Chemistry. 2005;3(10):1832. doi:10.1039/B502828K (Scholia)