The CDK has functionality for extracting information from files in many different file formats. Unfortunately, hardly ever the full format specification is supported, but generally the chemical graph and 2D or 3D coordinates are extracted, not uncommonly complemented with formal or partial charge.

File Format Detection

Typically, a human is fairly aware about the format of a file, if he looks at a file. Very often, the file extension (which is hidden on many Microsoft Windows versions by default) gives a clear clue. Files with the .mol and .sdf extension are very likely to have one of the MDL formats. If the file extension is ambiguous, a trained cheminformatician can often help you out quickly, applying tacit knowledge about those formats.

Computer programs, however, cannot rely on file formats, and have to formalize rules for inspecting the file content to determine what file format it is. The CDK has such functionality available for recognizing chemical file formats. But, to ensure no false detections are made, it requires a fairly accurate method for detecting the chemical format of a file. Appendix D.1 provides a list of all chemical file formats the CDK knows about.

Programmatically, the format of a file can be detected using the FormatFactory:

Script 11.1 code/GuessFormat.groovy

Reader stringReader = new StringReader(
  "<molecule xmlns=''/>"
FormatFactory factory = new FormatFactory();
IChemFormat format = factory.guessFormat(stringReader);
System.out.println("Format: " + format.getFormatName());

For example, this script recognizes that a file has the Chemical Markup Language [1,2] format:

Format: Chemical Markup Language

To learn if the CDK has a IChemObjectReader or IChemObjectWriter for a format, one can use the methods getReaderClassName() and getWriterClassName() respectively:

Script 11.2 code/HasReaderOrWriter.groovy

Reader stringReader = new StringReader(
  "<molecule xmlns=''/>"
IChemFormat format = factory.guessFormat(stringReader);
String readerClass = format.getReaderClassName();
String writerClass = format.getWriterClassName();
System.out.println("Reader: " + readerClass);
System.out.println("Writer: " + writerClass);

It reports:


Custom format matchers

The SMILES format is one of the few formats which does not have a matcher. This is because there is no generally accepted file format based on this line notation.

However, we can define a custom matcher ourselves and use that. First, the matcher will look something like:

Script 11.3 code/

public class SMILESFormatMatcher
  extends SMILESFormat
  implements IChemFormatMatcher {
  private static IResourceFormat instance = null;
  private SmilesParser parser = null;
  public SMILESFormatMatcher() {
    parser = new SmilesParser(
  public static IResourceFormat getInstance() {
    if (instance == null)
      instance = new SMILESFormatMatcher();
    return instance;
  public boolean matches(int lineNumber, String line) {
    if (lineNumber == 1) {
      String[] parts = line.split(" ");
      if (parts.length == 2) {
        String smiles = parts[0];
        String name = parts[1]; // not used here
        try {
          return true;
        } catch (InvalidSmilesException exception) {}
    return false;
  public final MatchResult matches(final List<String> lines) {
    if (lines.get(0) != null && matches(1, lines.get(0))) {
      return new MatchResult(
    return new MatchResult(false, null, Integer.MAX_VALUE);

If we then register this new matcher with the FormatFactory:

Script 11.4 code/GuessSMILES.groovyl

Reader stringReader = new StringReader(
  "O=CN formamide\n" +
  "OCC ethanol\n"
FormatFactory factory = new FormatFactory();
IChemFormat format = factory.guessFormat(stringReader);
System.out.println("Format: " + format.getFormatName());

And with this, we can detect a file with SMILES strings and names:

Format: SMILES

Keep in mind that the more specific your custom matcher is, the lower the change of other formats accidentally recognized by your custom matcher.


Reading from Readers and InputStreams

Many input readers in the CDK allow reading from a Java Reader class, but all are required to also read from an InputStream. The difference between these two Java classes is that the Reader is based on a character stream, while an InputStream is based on an byte stream. For some readers this difference is crucial: processing an XML based format, such as CML and XML formats used by PubChem should be read from an InputStream, not a Reader.

For other formats, it does not matter. This allows, for example, to read a file easily from a string with a StringReader (mind the newlines indicated by \n):

Script 11.5 code/InputFromStringReader.groovy

String bf3 = "4\n" +
"Bortrifluorid\n" +
"B    0.0000    0.0000    0.0000\n" +
"F    1.0000    0.0000    0.0000\n" +
"F   -0.5000   -0.8660    0.0000\n" +
"F   -0.5000    0.8660    0.0000\n";
reader = new XYZReader(
  new StringReader(bf3)
chemfile = ChemFile())
mol = ChemFileManipulator.getAllAtomContainers(chemfile)
println "Atom count: $mol.atomCount"

But besides reading XML files correctly, the support for InputStream also allows reading files directly from the internet and from gzipped files (see Section 12.4).

Example: Downloading Domoic Acid from PubChem

As an example, below will follow a small script that takes a PubChem compound identifier (CID) and downloads the corresponding ASN.1 XML file, parses it and counts the number of atoms:

Script 11.6 code/PubChemDownload.groovy

cid = 5282253
reader = new PCCompoundXMLReader(
  new URL(
    "" +
mol = AtomContainer())
println "CID: " + mol.getProperty("PubChem CID")
println "Atom count: $mol.atomCount"

It reports:

CID: 5282253
Atom count: 43

PubChem ASN.1 files come with an extensive list of molecular properties. These are stored as properties on the molecule object and can be retrieved using the getProperties() method, or, using the Groovy bean formalism:

Script 11.7 code/PubChemDownloadProperties.groovy {
  line = "" + it
  println line

which lists the properties for the earlier downloaded domoic acid:

PubChem CID=5282253
Compound Complexity=510
Fingerprint (SubStructure Keys)=00000371E0723800...
IUPAC Name (Allowed)=(2S,3S,4S)-3-(carboxymethyl...
  enyl]pyrrolidine-2-carboxylic acid
IUPAC Name (CAS-like Style)=(2S,3S,4S)-4-[(2Z,4E...
  ethyl)-2-pyrrolidinecarboxylic acid
IUPAC Name (Markup)=(2<I>S</I>,3<I>S</I>,4<I>S</...
  idine-2-carboxylic acid
IUPAC Name (Preferred)=(2S,3S,4S)-4-[(2Z,4E,6R)-...
  )pyrrolidine-2-carboxylic acid
IUPAC Name (Systematic)=(2S,3S,4S)-3-(2-hydroxy-...
  e-2-carboxylic acid
IUPAC Name (Traditional)=(2S,3S,4S)-3-(carboxyme...
InChI (Standard)=InChI=1S/C15H21NO6/c1-8(4-3-5-9...
Log P (XLogP3-AA)=-1.3
Mass (Exact)=311.13688739
Molecular Formula=C15H21NO6
Molecular Weight=311.33
SMILES (Canonical)=CC(C=CC=C(C)C1CNC(C1CC(=O)O)C...
SMILES (Isomeric)=C[C@H](/C=C/C=C(/C)\textbacksl...
Topological (Polar Surface Area)=124
Weight (MonoIsotopic)=311.13688739

Input Validation

The history of the CDK project has seen many bug reports about problems which in fact turned out to be problems with in the input file. While the general perception seems to be that because files could be written, the content must be consistent.

However, this is a strong misconception. There are several problems found in chemical files in the wild. A first common problem is that the file is not conform the syntax of the specification. An example here can be that at places where a number is expected, something else is given; not uncommonly, this is caused by incorrect use of whitespace.

A second problem is that the file looks perfectly reasonable, but that the software that wrote the file used conventions and extensions that are not supported by the reading software. A common example is the use of the D and T symbols, for deuterium and tritium in MDL molfiles, where the specification does not allow that.

A third problem is that most chemical file formats do not disallow incorrect chemical graphs. For example, formats often allow to bind an atom to itself, which will cause problems when analyzing this graph. These problems are much more rare, though.

Reading modes

The IChemObjectReader has a feature that allows setting a validating mode, which has two values:

Script 11.8 code/ReadingModes.groovy

IChemObjectReader.Mode.each {
  println it



The STRICT mode follows the exact format specification. There RELAXED mode allows for a few common extensions, such as the support for the T and D element types. For example, let’s consider this file:


  3  2  0  0  0  0  0  0  0  0999 V2000
    2.5369   -0.1550    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.0739    0.1550    0.0000 D   1  0  0  0  0  0  0  0  0  0  0  0
    2.0000    0.1550    0.0000 T   1  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
M  ISO   2   2   2   3   3

If we read this file with:

Script 11.9 code/ReadStrict.groovy

reader = new MDLV2000Reader(
  new File("data/t.mol").newReader(),
water = AtomContainer());
println "atom count: $water.atomCount"

we get this exception:

invalid symbol: D

However, if we read the file in RELAXED mode with this code:

Script 11.10 code/ReadRelaxed.groovy

reader = new MDLV2000Reader(
  new File("data/t.mol").newReader(),
water = AtomContainer());
println "atom count: $water.atomCount"

the files will be read as desired:

atom count: 3


When a file is being read in RELAXED mode, it is possible to get error messages. This functionality is provided by the IChemObjectReaderErrorHandler support in IChemObjectReader. For example, we can define this custom error handler:

Script 11.11 code/CustomErrorHandler.groovy

class ErrorHandler
implements IChemObjectReaderErrorHandler {
  public void handleError(String message) {
    println message;
  public void handleError(String message,
    Exception exception)
    println message + "\n  -> " +
  public void handleError(String message,
    int row, int colStart, int colEnd)
    print "location: " + row + ", " + 
          colStart + "-" + colEnd + ": ";
    println message;
  public void handleError(String message,
    int row, int colStart, int colEnd,
    Exception exception)
    print "location: " + row + ", " +
          colStart + "-" + colEnd + ": "
    println message + "\n  -> " +

and use that when reading a file:

Script 11.12 code/ReadErrorHandler.groovy

reader = new MDLV2000Reader(
  new File("data/t.mol").newReader(),
reader.setErrorHandler(new ErrorHandler());
water = AtomContainer());

we get these warnings via the handler interface:

location: 6, 31-33: invalid symbol: D
location: 7, 31-33: invalid symbol: T

Because of an issue in version 2.8 of the CDK, the above does not show any warnings. This has been fixed in CDK 2.3, see commit 547b028e17656f54a080a885a166377320b3a8ad.

Gzipped files

Some remote databases gzip their data files to reduce download sized. The Protein Brookhaven Database (PDB) is such a database. Fortunately, Java has a simple API to work with gzipped files, using the GZIPInputStream:

Script 11.13 code/PDBCoordinateExtraction.groovy

reader = new PDBReader(
  new GZIPInputStream(
    new URL(
crambin = ChemFile());
for (container in
     )) {
  for (atom in container.atoms()) {
    println atom.point3d;

Iterating Readers

By default, the CDK readers read structures into memory. This is fine when it is a relatively small model. It no longer works for large files, such as 1GB MDL SD files [3]. To allow processing of such large files, the CDK can take advantage from the fact that these SD files are basically a concatenation of MDL molfiles. Therefore, one can use an iterating reader to process each individual molecule one by one.

MDL SD files

MDL SD files can be processed using the IteratingSDFReader, for example, to generate a SMILES for each structure:

Script 11.14 code/IteratingSDFReaderDemo.groovy

iterator = new IteratingSDFReader(
  new File("data/test6.sdf").newReader(),
while (iterator.hasNext()) {
  IAtomContainer mol =
  formula = MolecularFormulaManipulator
  println MolecularFormulaManipulator

Which outputs the molecular formula for the three entries in the file:


PubChem Compounds XML files

Similarly, PubChem Compounds XML files can be processed taking advantage of a XML pull library, which is nicely hidden behind the same iterator interface as used for parsing MDL SD files. Iterating over a set of compounds is fairly straightforward with the IteratingPCCompoundXMLReader class:

Script 11.15 code/PubChemCompoundsXMLDemo.groovy

iterator = new IteratingPCCompoundXMLReader(
  new File("data/aceticAcids38.xml").newReader(),
while (iterator.hasNext()) {
  IAtomContainer mol =
  formula = MolecularFormulaManipulator
  println MolecularFormulaManipulator.getString(formula)

Which outputs the molecular formula for the three entries in the aceticAcids38.xml file:


Customizing the Output

An interesting feature of file IO in the CDK is that it is customizable. Before I will give all the details, let’s start with a simple example: creating a Gaussian input file for optimizing the structure of methane, and let’s start with an XYZ file, that is, with

C  0.25700 -0.36300  0.00000
H  0.25700  0.72700  0.00000
H  0.77100 -0.72700  0.89000
H  0.77100 -0.72700 -0.89000
H -0.77100 -0.72700  0.00000

The output will look something like:

# b3lyp/6-31g opt

Job started on Linux cluster on 20041010.

0 1
C 0 0.257 -0.363 0.0
H 0 0.257 0.727 0.0
H 0 0.771 -0.727 0.89
H 0 0.771 -0.727 -0.89
H 0 -0.771 -0.727 0.0

The writer used the default IO options in the above example. So, the next step is to see which options the writer allows. To get a list of options for a certain IO class in one does something along the lines:

Script 11.16 code/ListIOOptions.groovy

IChemObjectWriter writer = new GaussianInputWriter();
for (IOSetting setting : writer.getIOSettings()) {
  println "[" + setting.getName() + "]"
  println "Option: " + setting.getQuestion()
  println "Current value: " + setting.getSetting()

which results in the following output:

Option: Should the calculation be open shell?
Current value: false
Option: What comment should be put in the file?
Current value: Created with CDK (http://cdk.sf.n...
Option: How much memory do you want to use?
Current value: unset
Option: What kind of job do you want to perform?
Current value: energy calculation
Option: How many processors should be used by Ga...
Current value: 1

Setting Properties

The IO settings system allows interactive setting of these options, but a perfectly fine alternative is to use a Java Properties object.

Consider the following source code:

Script 11.17 code/PropertiesSettings.groovy

// the custom settings
Properties customSettings = new Properties();
customSettings.setProperty("Basis",   "6-31g*");
  "geometry optimization");
  "Job started on Linux cluster on 20041010.");
customSettings.setProperty("ProcessorCount", "5");
PropertiesListener listener = new PropertiesListener(
// create the writer
GaussianInputWriter writer = new GaussianInputWriter(
  new FileWriter(new File("methane.gin"))
XYZReader reader = new XYZReader(
  new FileReader(new File("data/"))
// convert the file
ChemFile content = (ChemFile) ChemFile());
IAtomContainer molecule = content.getChemSequence(0).

The PropertiesListener takes a Properties class as parameter in its constructor. Therefore, the properties are defined by the customSettings variable in the first few lines. The PropertiesListener listener is the instantiated with the customizations as constructor parameter.

The output writer, specified to write to the methane.gin file, is created after which the ChemObjectIOListener is set. Only by setting this listener, the output will be customized with the earlier defined properties. The rest of the code reads a molecule from an XYZ file and writes the content to the created Gaussian Input file.

Example: creating unit tests for atom type perception

We saw earlier an example for reading files directly from PubChem (see Section 12.2.1). This can be conveniently used to create CDK source code, for example, for use in unit tests for the atom type perception code (see Section 13.2). But because we do not want 2D and 3D coordinates being set in the source code, we disable those options:

Script 11.18 code/AtomTypeUnitTest.groovy

cid = 3396560
mol = AtomContainer())
stringWriter = new StringWriter();
CDKSourceCodeWriter writer =
  new CDKSourceCodeWriter(stringWriter);
customSettings = new Properties();
customSettings.setProperty("write2DCoordinates", "false");
customSettings.setProperty("write3DCoordinates", "false");
  new PropertiesListener(
println stringWriter.toString();

This results in this source code:

  IChemObjectBuilder builder = DefaultChemObject...
  IAtomContainer mol = builder.newInstance(IAtom...
  IAtom a1 = builder.newInstance(IAtom.class,"P");
  IAtom a2 = builder.newInstance(IAtom.class,"O");
  IAtom a3 = builder.newInstance(IAtom.class,"O");
  IAtom a4 = builder.newInstance(IAtom.class,"C");
  IAtom a5 = builder.newInstance(IAtom.class,"H");
  IAtom a6 = builder.newInstance(IAtom.class,"H");
  IAtom a7 = builder.newInstance(IAtom.class,"H");
  IAtom a8 = builder.newInstance(IAtom.class,"H");
  IAtom a9 = builder.newInstance(IAtom.class,"H");
  IBond b1 = builder.newInstance(IBond.class,a1,...
   a2, IBond.Order.SINGLE);
  IBond b2 = builder.newInstance(IBond.class,a1,...
   a3, IBond.Order.DOUBLE);
  IBond b3 = builder.newInstance(IBond.class,a1,...
   a4, IBond.Order.SINGLE);
  IBond b4 = builder.newInstance(IBond.class,a1,...
   a5, IBond.Order.SINGLE);
  IBond b5 = builder.newInstance(IBond.class,a2,...
   a9, IBond.Order.SINGLE);
  IBond b6 = builder.newInstance(IBond.class,a4,...
   a6, IBond.Order.SINGLE);
  IBond b7 = builder.newInstance(IBond.class,a4,...
   a7, IBond.Order.SINGLE);
  IBond b8 = builder.newInstance(IBond.class,a4,...
   a8, IBond.Order.SINGLE);

Line Notations

Another common input mechanism in cheminformatics is the line notation. Several line notations have been proposed, including the Wiswesser Line Notation (WLN) [4] and the Sybyl Line Notation (SLN) [5], but the most popular is SMILES [6]. There is a Open Standard around this format called OpenSMILES, available at


The CDK can both read and write SMILES, or at least a significant subset of the line notation. You can parse a SMILES into a IAtomContainer with the SmilesParser. The constructor of the parser takes an IChemObjectBuilder (see Section 11) because it needs to know what CDK interface implementation it must use to create classes. This example uses the DefaultChemObjectBuilder:

Script 11.19 code/ReadSMILES.groovy

sp = new SmilesParser(
mol = sp.parseSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")
println "Aspirin has ${mol.atomCount} atoms."

Telling us the number of (non-hydrogen) atoms in aspirin:

Aspirin has 13 atoms.

Writing of SMILES goes in a similar way. But I do like to point out that by default the SMILESGenerator does not use the convention to use lower case element symbols for aromatic atoms.

Script 11.20 code/WriteSMILES.groovy

mol = MoleculeFactory.makePhenylAmine()
generator = SmilesGenerator.generic()
smiles = generator.createSMILES(mol)
println "Ph-NH2 -> $smiles"
generator = SmilesGenerator.generic().aromatic()
smiles = generator.createSMILES(mol)
println "Ph-NH2 -> $smiles"

showing the different output without and with that option set:

Ph-NH2 -> C1(=CC=CC=C1)N
Ph-NH2 -> c1(ccccc1)N

The generic format does not output stereo information. For isomeric SMILES we need to use a different approach:

Script code/WriteIsomericSMILES.groovy

smiles = "F[C@@H](Cl)(Br)"
mol = smilesParser.parseSmiles(smiles)
generator = SmilesGenerator.generic()
smiles = generator.createSMILES(mol)
println "Generic SMILES: $smiles"
generator = SmilesGenerator.isomeric()
smiles = generator.createSMILES(mol)
println "Isomeric SMILES: $smiles"

showing the difference in output between .generic() and .isomeric:

Generic SMILES: FC(Cl)Br
Isomeric SMILES: F[C@@H](Cl)Br

Of course, this does require that aromaticity has been perceived, as explained in Section 17.5.


This section will list for a few formats a recipe for how to read content from those formats, taking into account common issues with the input.

MDL molfile (V2000)

Like any file format, they support a limited number of features. For example, MDL files cannot represent a bond order 4, a quadruple bond. Other missing explicit details include hydrogens, and atom-based stereochemistry. Stereochemistry is wedge-bond-based, see Section ??.

An example file which uses the bond order 4, is this file:


 10 11  0  0  0  0  0  0  0  0999 V2000
  208.0000  866.5142    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  175.5651  882.1340    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  167.5544  917.2314    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  190.0000  945.3774    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  226.0000  945.3774    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  248.4456  917.2314    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  240.4349  882.1340    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  271.3391  863.6697    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  298.4496  887.3555    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  284.3007  920.4585    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  4  0  0  0  0 
  2  3  4  0  0  0  0 
  3  4  4  0  0  0  0 
  4  5  4  0  0  0  0 
  5  6  4  0  0  0  0 
  6  7  4  0  0  0  0 
  7  1  4  0  0  0  0 
  7  8  4  0  0  0  0 
  8  9  4  0  0  0  0 
  9 10  4  0  0  0  0 
 10  6  4  0  0  0  0 

More recent MDL formats have become more powerful. The V3000 format can do much more than the V2000 format, or even the pre-V2000 format.

Here’s a recipe with inline comments:

Script 11.21 code/InputMDLMolfiles.groovy

reader = new MDLV2000Reader(
  new File("data/azulene4.mol").newReader(),
azulene = AtomContainer());
// perceive atom types
// add missing hydrogens
// if bond order 4 was present,
// deduce bond orders
println "Atom count: " + azulene.atomCount
doubleBondCount = 0
singleBondCount = 0
for (bond in azulene.bonds()) {
  if (bond.order == Order.DOUBLE)
  if (bond.order == Order.SINGLE) 
println "Single bonds: " + singleBondCount
println "Double bonds: " + doubleBondCount

This code will perceive CDK atom types. These types are needed to add the missing hydrogens, as well as to resolve the bond order information. The input has ten atoms and eleven bonds, all marked with bond order 4.

The result of the above post-processing is:

Atom count: 10
Single bonds: 6
Double bonds: 5


  1. Murray-Rust P, Rzepa HS. Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles. JCICS. 1999 Nov;39(6):928–42. doi:10.1021/CI990052B (Scholia)
  2. Willighagen E. Processing CML conventions in Java. Internet Journal of Chemistry [Internet]. 2001 Feb 12;4:4. Available from: doi:10.5281/ZENODO.1495470 (Scholia)
  3. Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA, et al. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. JCICS. 1992 May 1;32(3):244–55. doi:10.1021/CI00007A012 (Scholia)
  4. Wiswesser WJ. How the WLN began in 1949 and how it might be in 1999. JCICS. 1982 May 1;22(2):88–93. doi:10.1021/CI00034A005 (Scholia)
  5. Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD. SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries. JCIM. 2008 Dec 1;48(12):2294–307. doi:10.1021/CI7004687 (Scholia)
  6. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. JCICS [Internet]. 1988 Feb 1;28(1):31–6. Available from: doi:10.1021/CI00057A005 (Scholia)