The CDK has functionality for extracting information from files in many different file formats. Unfortunately, hardly ever the full format specification is supported, but generally the chemical graph and 2D or 3D coordinates are extracted, not uncommonly complemented with formal or partial charge.
Typically, a human is fairly aware about the format of a file, if he looks at a file. Very often, the file extension (which is hidden on many Microsoft Windows versions by default) gives a clear clue. Files with the .mol and .sdf extension are very likely to have one of the MDL formats. If the file extension is ambiguous, a trained cheminformatician can often help you out quickly, applying tacit knowledge about those formats.
Computer programs, however, cannot rely on file formats, and have to formalize rules for inspecting the file content to determine what file format it is. The CDK has such functionality available for recognizing chemical file formats. But, to ensure no false detections are made, it requires a fairly accurate method for detecting the chemical format of a file. Appendix D.1 provides a list of all chemical file formats the CDK knows about.
Programmatically, the format of a file can be detected using the
FormatFactory
:
Script 11.1 code/GuessFormat.groovy
Reader stringReader = new StringReader(
"<molecule xmlns='http://www.xml-cml.org/schema'/>"
);
FormatFactory factory = new FormatFactory();
IChemFormat format = factory.guessFormat(stringReader);
System.out.println("Format: " + format.getFormatName());
For example, this script recognizes that a file has the Chemical Markup Language [1,2] format:
Format: Chemical Markup Language
To learn if the CDK has a IChemObjectReader
or
IChemObjectWriter
for a format, one can use the methods getReaderClassName()
and getWriterClassName()
respectively:
Script 11.2 code/HasReaderOrWriter.groovy
Reader stringReader = new StringReader(
"<molecule xmlns='http://www.xml-cml.org/schema'/>"
);
IChemFormat format = factory.guessFormat(stringReader);
String readerClass = format.getReaderClassName();
String writerClass = format.getWriterClassName();
System.out.println("Reader: " + readerClass);
System.out.println("Writer: " + writerClass);
It reports:
Reader: org.openscience.cdk.io.CMLReader
Writer: org.openscience.cdk.io.CMLWriter
The SMILES format is one of the few formats which does not have a matcher. This is because there is no generally accepted file format based on this line notation.
However, we can define a custom matcher ourselves and use that. First, the matcher will look something like:
Script 11.3 code/SMILESFormatMatcher.java
public class SMILESFormatMatcher
extends SMILESFormat
implements IChemFormatMatcher {
private static IResourceFormat instance = null;
private SmilesParser parser = null;
public SMILESFormatMatcher() {
parser = new SmilesParser(
SilentChemObjectBuilder.getInstance()
);
}
public static IResourceFormat getInstance() {
if (instance == null)
instance = new SMILESFormatMatcher();
return instance;
}
public boolean matches(int lineNumber, String line) {
if (lineNumber == 1) {
String[] parts = line.split(" ");
if (parts.length == 2) {
String smiles = parts[0];
String name = parts[1]; // not used here
try {
parser.parseSmiles(smiles);
return true;
} catch (InvalidSmilesException exception) {}
}
}
return false;
}
public final MatchResult matches(final List<String> lines) {
if (lines.get(0) != null && matches(1, lines.get(0))) {
return new MatchResult(
true,
(IChemFormat)SMILESFormat.getInstance(),
Integer.valueOf(1)
);
}
return new MatchResult(false, null, Integer.MAX_VALUE);
}
}
If we then register this new matcher with the FormatFactory
:
Script 11.4 code/GuessSMILES.groovyl
Reader stringReader = new StringReader(
"O=CN formamide\n" +
"OCC ethanol\n"
);
FormatFactory factory = new FormatFactory();
factory.registerFormat(SMILESFormatMatcher.getInstance());
IChemFormat format = factory.guessFormat(stringReader);
System.out.println("Format: " + format.getFormatName());
And with this, we can detect a file with SMILES strings and names:
Format: SMILES
Keep in mind that the more specific your custom matcher is, the lower the change of other formats accidentally recognized by your custom matcher.
REINSERT TABLE
Many input readers in the CDK allow reading from a Java Reader
class,
but all are required to also read from an InputStream
. The difference
between these two Java classes is that the Reader is based on a character
stream, while an InputStream is based on an byte stream. For some readers this
difference is crucial: processing an XML based format, such as CML and XML
formats used by PubChem should be read from an InputStream, not a Reader.
For other formats, it does not matter. This allows, for example, to read
a file easily from a string with a StringReader
(mind the newlines indicated by \n
):
Script 11.5 code/InputFromStringReader.groovy
String bf3 = "4\n" +
"Bortrifluorid\n" +
"B 0.0000 0.0000 0.0000\n" +
"F 1.0000 0.0000 0.0000\n" +
"F -0.5000 -0.8660 0.0000\n" +
"F -0.5000 0.8660 0.0000\n";
reader = new XYZReader(
new StringReader(bf3)
)
chemfile = reader.read(new ChemFile())
mol = ChemFileManipulator.getAllAtomContainers(chemfile)
.get(0)
println "Atom count: $mol.atomCount"
But besides reading XML files correctly, the support for InputStream also allows reading files directly from the internet and from gzipped files (see Section 12.4).
As an example, below will follow a small script that takes a PubChem compound identifier (CID) and downloads the corresponding ASN.1 XML file, parses it and counts the number of atoms:
Script 11.6 code/PubChemDownload.groovy
cid = 5282253
reader = new PCCompoundXMLReader(
new URL(
"https://pubchem.ncbi.nlm.nih.gov/summary/" +
"summary.cgi?cid=$cid&disopt=SaveXML"
).newInputStream()
)
mol = reader.read(new AtomContainer())
println "CID: " + mol.getProperty("PubChem CID")
println "Atom count: $mol.atomCount"
It reports:
CID: 5282253
Atom count: 43
PubChem ASN.1 files come with an extensive list of molecular properties. These
are stored as properties on the molecule object and can be retrieved using the
getProperties()
method, or, using the Groovy bean formalism:
Script 11.7 code/PubChemDownloadProperties.groovy
mol.properties.each {
line = "" + it
println line
}
which lists the properties for the earlier downloaded domoic acid:
PubChem CID=5282253
Compound Complexity=510
Fingerprint (SubStructure Keys)=00000371E0723800000000000000000000000000000160...
000000000000000000000000000000001E00100800000D28C18004020802C00200880220D208...
000000002000000808818800080A001200812004400004D000988003BC7F020E800000000000...
00000000000000000000000000000000
IUPAC Name (Allowed)=(2S,3S,4S)-3-(carboxymethyl)-4-[(1Z,3E,5R)-5-carboxy-1-me...
thyl-hexa-1,3-dienyl]pyrrolidine-2-carboxylic acid
IUPAC Name (CAS-like Style)=(2S,3S,4S)-4-[(2Z,4E,6R)-6-carboxyhepta-2,4-dien-2...
-yl]-3-(carboxymethyl)-2-pyrrolidinecarboxylic acid
IUPAC Name (Markup)=(2<I>S</I>,3<I>S</I>,4<I>S</I>)-4-[(2<I>Z</I>,4<I>E</I>,6<...
I>R</I>)-6-carboxyhepta-2,4-dien-2-yl]-3-(carboxymethyl)pyrrolidine-2-carbox...
ylic acid
IUPAC Name (Preferred)=(2S,3S,4S)-4-[(2Z,4E,6R)-6-carboxyhepta-2,4-dien-2-yl]-...
3-(carboxymethyl)pyrrolidine-2-carboxylic acid
IUPAC Name (Systematic)=(2S,3S,4S)-3-(2-hydroxy-2-oxoethyl)-4-[(2Z,4E,6R)-6-me...
thyl-7-oxidanyl-7-oxidanylidene-hepta-2,4-dien-2-yl]pyrrolidine-2-carboxylic...
acid
IUPAC Name (Traditional)=(2S,3S,4S)-3-(carboxymethyl)-4-[(1Z,3E,5R)-5-carboxy-...
1-methyl-hexa-1,3-dienyl]proline
InChI (Standard)=InChI=1S/C15H21NO6/c1-8(4-3-5-9(2)14(19)20)11-7-16-13(15(21)2...
2)10(11)6-12(17)18/h3-5,9-11,13,16H,6-7H2,1-2H3,(H,17,18)(H,19,20)(H,21,22)/...
b5-3+,8-4-/t9-,10+,11-,13+/m1/s1
InChIKey (Standard)=VZFRNCSOCOPNDB-AOKDLOFSSA-N
Log P (XLogP3-AA)=-1.3
Mass (Exact)=311.13688739
Molecular Formula=C15H21NO6
Molecular Weight=311.33
SMILES (Canonical)=CC(C=CC=C(C)C1CNC(C1CC(=O)O)C(=O)O)C(=O)O
SMILES (Isomeric)=C[C@H](/C=C/C=C(/C)\textbackslash[C@H]1CN[C@@H]([C@H]1CC(=O)...
O)C(=O)O)C(=O)O
Topological (Polar Surface Area)=124
Weight (MonoIsotopic)=311.13688739
The history of the CDK project has seen many bug reports about problems which in fact turned out to be problems with in the input file. While the general perception seems to be that because files could be written, the content must be consistent.
However, this is a strong misconception. There are several problems found in chemical files in the wild. A first common problem is that the file is not conform the syntax of the specification. An example here can be that at places where a number is expected, something else is given; not uncommonly, this is caused by incorrect use of whitespace.
A second problem is that the file looks perfectly reasonable, but that the software that wrote the file used conventions and extensions that are not supported by the reading software. A common example is the use of the D and T symbols, for deuterium and tritium in MDL molfiles, where the specification does not allow that.
A third problem is that most chemical file formats do not disallow incorrect chemical graphs. For example, formats often allow to bind an atom to itself, which will cause problems when analyzing this graph. These problems are much more rare, though.
The IChemObjectReader
has a feature that allows setting
a validating mode, which has two values:
Script 11.8 code/ReadingModes.groovy
IChemObjectReader.Mode.each {
println it
}
returning:
RELAXED
STRICT
The STRICT
mode follows the exact format specification. There
RELAXED
mode allows for a few common extensions, such as
the support for the T and D element types. For example, let’s consider
this file:
CDK
3 2 0 0 0 0 0 0 0 0999 V2000
2.5369 -0.1550 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.0739 0.1550 0.0000 D 1 0 0 0 0 0 0 0 0 0 0 0
2.0000 0.1550 0.0000 T 1 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
M ISO 2 2 2 3 3
M END
If we read this file with:
Script 11.9 code/ReadStrict.groovy
reader = new MDLV2000Reader(
new File("data/t.mol").newReader(),
Mode.STRICT
);
water = reader.read(new AtomContainer());
println "atom count: $water.atomCount"
we get this exception:
invalid symbol: D
However, if we read the file in RELAXED
mode with this code:
Script 11.10 code/ReadRelaxed.groovy
reader = new MDLV2000Reader(
new File("data/t.mol").newReader(),
Mode.RELAXED
);
water = reader.read(new AtomContainer());
println "atom count: $water.atomCount"
the files will be read as desired:
atom count: 3
When a file is being read in RELAXED
mode, it is possible to get
error messages. This functionality is provided by the
IChemObjectReaderErrorHandler
support in
IChemObjectReader
.
For example, we can define this custom error handler:
Script 11.11 code/CustomErrorHandler.groovy
class ErrorHandler
implements IChemObjectReaderErrorHandler {
public void handleError(String message) {
println message;
};
public void handleError(String message,
Exception exception)
{
println message + "\n -> " +
exception.getMessage();
};
public void handleError(String message,
int row, int colStart, int colEnd)
{
print "location: " + row + ", " +
colStart + "-" + colEnd + ": ";
println message;
};
public void handleError(String message,
int row, int colStart, int colEnd,
Exception exception)
{
print "location: " + row + ", " +
colStart + "-" + colEnd + ": "
println message + "\n -> " +
exception.getMessage()
};
}
and use that when reading a file:
Script 11.12 code/ReadErrorHandler.groovy
reader = new MDLV2000Reader(
new File("data/t.mol").newReader(),
Mode.RELAXED
);
reader.setErrorHandler(new ErrorHandler());
water = reader.read(new AtomContainer());
we get these warnings via the handler interface:
location: 6, 31-33: invalid symbol: D
location: 7, 31-33: invalid symbol: T
Because of an issue in version 2.9 of the CDK, the above does not show any warnings. This has been fixed in CDK 2.3, see commit 547b028e17656f54a080a885a166377320b3a8ad.
Some remote databases gzip their data files to reduce download sized.
The Protein Brookhaven Database (PDB) is such a database. Fortunately, Java
has a simple API to work with gzipped files, using the GZIPInputStream
:
Script 11.13 code/PDBCoordinateExtraction.groovy
reader = new PDBReader(
new GZIPInputStream(
new URL(
"http://files.rcsb.org/download/1CRN.pdb.gz"
).openStream()
)
);
crambin = reader.read(new ChemFile());
for (container in
ChemFileManipulator.getAllAtomContainers(
crambin
)) {
for (atom in container.atoms()) {
println atom.point3d;
}
}
By default, the CDK readers read structures into memory. This is fine when it is a relatively small model. It no longer works for large files, such as 1GB MDL SD files [3]. To allow processing of such large files, the CDK can take advantage from the fact that these SD files are basically a concatenation of MDL molfiles. Therefore, one can use an iterating reader to process each individual molecule one by one.
MDL SD files can be processed using the IteratingSDFReader
, for
example, to generate a SMILES for each structure:
Script 11.14 code/IteratingSDFReaderDemo.groovy
iterator = new IteratingSDFReader(
new File("data/test6.sdf").newReader(),
DefaultChemObjectBuilder.getInstance()
)
while (iterator.hasNext()) {
IAtomContainer mol = iterator.next()
formula = MolecularFormulaManipulator
.getMolecularFormula(mol)
println MolecularFormulaManipulator
.getString(formula)
}
Which outputs the molecular formula for the three entries in the file:
C19H24Br2N2O6
C20H24N2O5S
C17H22N2O6S
Similarly, PubChem Compounds XML files can be processed taking advantage
of a XML pull library, which is nicely hidden behind the same iterator
interface as used for parsing MDL SD files. Iterating over a set
of compounds is fairly straightforward with the
IteratingPCCompoundXMLReader
class:
Script 11.15 code/PubChemCompoundsXMLDemo.groovy
iterator = new IteratingPCCompoundXMLReader(
new File("data/aceticAcids38.xml").newReader(),
DefaultChemObjectBuilder.getInstance()
)
while (iterator.hasNext()) {
IAtomContainer mol = iterator.next()
formula = MolecularFormulaManipulator
.getMolecularFormula(mol)
println MolecularFormulaManipulator.getString(formula)
}
Which outputs the molecular formula for the three entries in the
aceticAcids38.xml
file:
C2H4O2
[C2H3O2]-
[C2H3HgO2]+
An interesting feature of file IO in the CDK is that it is customizable. Before
I will give all the details, let’s start with a simple example: creating a
Gaussian input file for optimizing the structure of methane,
and let’s start with an XYZ file, that is, with methane.xyz
:
5
methane
C 0.25700 -0.36300 0.00000
H 0.25700 0.72700 0.00000
H 0.77100 -0.72700 0.89000
H 0.77100 -0.72700 -0.89000
H -0.77100 -0.72700 0.00000
The output will look something like:
%nprocl=5
# b3lyp/6-31g opt
Job started on Linux cluster on 20041010.
0 1
C 0 0.257 -0.363 0.0
H 0 0.257 0.727 0.0
H 0 0.771 -0.727 0.89
H 0 0.771 -0.727 -0.89
H 0 -0.771 -0.727 0.0
The writer used the default IO options in the above example. So, the next step is to see which options the writer allows. To get a list of options for a certain IO class in one does something along the lines:
Script 11.16 code/ListIOOptions.groovy
IChemObjectWriter writer = new GaussianInputWriter();
for (IOSetting setting : writer.getIOSettings()) {
println "[" + setting.getName() + "]"
println "Option: " + setting.getQuestion()
println "Current value: " + setting.getSetting()
}
which results in the following output:
[OpenShell]
Option: Should the calculation be open shell?
Current value: false
[Comment]
Option: What comment should be put in the file?
Current value: Created with CDK (http://cdk.sf.net/)
[Memory]
Option: How much memory do you want to use?
Current value: unset
[Command]
Option: What kind of job do you want to perform?
Current value: energy calculation
[ProcessorCount]
Option: How many processors should be used by Gaussian?
Current value: 1
The IO settings system allows interactive setting of these options, but a perfectly fine alternative is to use a Java Properties object.
Consider the following source code:
Script 11.17 code/PropertiesSettings.groovy
// the custom settings
Properties customSettings = new Properties();
customSettings.setProperty("Basis", "6-31g*");
customSettings.setProperty("Command",
"geometry optimization");
customSettings.setProperty("Comment",
"Job started on Linux cluster on 20041010.");
customSettings.setProperty("ProcessorCount", "5");
PropertiesListener listener = new PropertiesListener(
customSettings
);
// create the writer
GaussianInputWriter writer = new GaussianInputWriter(
new FileWriter(new File("methane.gin"))
);
writer.addChemObjectIOListener(listener);
XYZReader reader = new XYZReader(
new FileReader(new File("data/methane.xyz"))
);
// convert the file
ChemFile content = (ChemFile)reader.read(new ChemFile());
IAtomContainer molecule = content.getChemSequence(0).
getChemModel(0).getMoleculeSet().getAtomContainer(0);
writer.write(molecule);
writer.close();
The PropertiesListener
takes a Properties
class as parameter in
its constructor. Therefore, the properties are defined by the
customSettings
variable in the first few lines. The
PropertiesListener
listener
is the instantiated with the
customizations as constructor parameter.
The output writer, specified to write to the methane.gin
file, is
created after which the ChemObjectIOListener
is set. Only by setting
this listener, the output will be customized with the earlier defined
properties. The rest of the code reads a molecule from an XYZ file and writes
the content
to the created Gaussian Input file.
We saw earlier an example for reading files directly from PubChem
(see Section 12.2.1).
This can be conveniently used to create CDK source code
, for example,
for use in unit tests for the atom type perception code (see
Section 13.2). But because we do not want
2D and 3D coordinates being set in the source code, we disable those
options:
Script 11.18 code/AtomTypeUnitTest.groovy
cid = 3396560
mol = reader.read(new AtomContainer())
stringWriter = new StringWriter();
CDKSourceCodeWriter writer =
new CDKSourceCodeWriter(stringWriter);
customSettings = new Properties();
customSettings.setProperty("write2DCoordinates", "false");
customSettings.setProperty("write3DCoordinates", "false");
writer.addChemObjectIOListener(
new PropertiesListener(
customSettings
)
)
writer.write(mol);
writer.close();
println stringWriter.toString();
This results in this source code:
{
IChemObjectBuilder builder = DefaultChemObjectBuilder.getInstance();
IAtomContainer mol = builder.newInstance(IAtomContainer.class);
IAtom a1 = builder.newInstance(IAtom.class,"P");
a1.setFormalCharge(0);
mol.addAtom(a1);
IAtom a2 = builder.newInstance(IAtom.class,"O");
a2.setFormalCharge(0);
mol.addAtom(a2);
IAtom a3 = builder.newInstance(IAtom.class,"O");
a3.setFormalCharge(0);
mol.addAtom(a3);
IAtom a4 = builder.newInstance(IAtom.class,"C");
a4.setFormalCharge(0);
mol.addAtom(a4);
IAtom a5 = builder.newInstance(IAtom.class,"H");
a5.setFormalCharge(0);
mol.addAtom(a5);
IAtom a6 = builder.newInstance(IAtom.class,"H");
a6.setFormalCharge(0);
mol.addAtom(a6);
IAtom a7 = builder.newInstance(IAtom.class,"H");
a7.setFormalCharge(0);
mol.addAtom(a7);
IAtom a8 = builder.newInstance(IAtom.class,"H");
a8.setFormalCharge(0);
mol.addAtom(a8);
IAtom a9 = builder.newInstance(IAtom.class,"H");
a9.setFormalCharge(0);
mol.addAtom(a9);
IBond b1 = builder.newInstance(IBond.class,a1, a2, IBond.Order.SINGLE);
mol.addBond(b1);
IBond b2 = builder.newInstance(IBond.class,a1, a3, IBond.Order.DOUBLE);
mol.addBond(b2);
IBond b3 = builder.newInstance(IBond.class,a1, a4, IBond.Order.SINGLE);
mol.addBond(b3);
IBond b4 = builder.newInstance(IBond.class,a1, a5, IBond.Order.SINGLE);
mol.addBond(b4);
IBond b5 = builder.newInstance(IBond.class,a2, a9, IBond.Order.SINGLE);
mol.addBond(b5);
IBond b6 = builder.newInstance(IBond.class,a4, a6, IBond.Order.SINGLE);
mol.addBond(b6);
IBond b7 = builder.newInstance(IBond.class,a4, a7, IBond.Order.SINGLE);
mol.addBond(b7);
IBond b8 = builder.newInstance(IBond.class,a4, a8, IBond.Order.SINGLE);
mol.addBond(b8);
}
Another common input mechanism in cheminformatics is the line notation. Several line notations have been proposed, including the Wiswesser Line Notation (WLN) [4] and the Sybyl Line Notation (SLN) [5], but the most popular is SMILES [6]. There is a Open Standard around this format called OpenSMILES, available at http://www.opensmiles.org/.
The CDK can both read and write SMILES, or at least a significant subset of the
line notation. You can parse a SMILES into a IAtomContainer with the
SmilesParser
. The constructor of the parser takes an IChemObjectBuilder
(see Section 11)
because it needs to know what CDK interface implementation it must use to create
classes. This example uses the DefaultChemObjectBuilder
:
Script 11.19 code/ReadSMILES.groovy
sp = new SmilesParser(
DefaultChemObjectBuilder.getInstance()
)
mol = sp.parseSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")
println "Aspirin has ${mol.atomCount} atoms."
Telling us the number of (non-hydrogen) atoms in aspirin:
Aspirin has 13 atoms.
Writing of SMILES goes in a similar way. But I do like to point out that by default
the SMILESGenerator
does not use the convention to use lower case element
symbols for aromatic atoms.
Script 11.20 code/WriteSMILES.groovy
mol = MoleculeFactory.makePhenylAmine()
generator = SmilesGenerator.generic()
smiles = generator.createSMILES(mol)
println "Ph-NH2 -> $smiles"
generator = SmilesGenerator.generic().aromatic()
smiles = generator.createSMILES(mol)
println "Ph-NH2 -> $smiles"
showing the different output without and with that option set:
Ph-NH2 -> C1(=CC=CC=C1)N
Ph-NH2 -> c1(ccccc1)N
The generic
format does not output stereo information. For isomeric SMILES we need to use
a different approach:
Script code/WriteIsomericSMILES.groovy
smiles = "F[C@@H](Cl)(Br)"
mol = smilesParser.parseSmiles(smiles)
generator = SmilesGenerator.generic()
smiles = generator.createSMILES(mol)
println "Generic SMILES: $smiles"
generator = SmilesGenerator.isomeric()
smiles = generator.createSMILES(mol)
println "Isomeric SMILES: $smiles"
showing the difference in output between .generic()
and .isomeric
:
Generic SMILES: FC(Cl)Br
Isomeric SMILES: F[C@@H](Cl)Br
Of course, this does require that aromaticity has been perceived, as explained in Section 18.5.
This section will list for a few formats a recipe for how to read content from those formats, taking into account common issues with the input.
Like any file format, they support a limited number of features. For example, MDL files cannot represent a bond order 4, a quadruple bond. Other missing explicit details include hydrogens, and atom-based stereochemistry. Stereochemistry is wedge-bond-based, see Section ??.
An example file which uses the bond order 4, is this file:
CDK
10 11 0 0 0 0 0 0 0 0999 V2000
208.0000 866.5142 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
175.5651 882.1340 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
167.5544 917.2314 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
190.0000 945.3774 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
226.0000 945.3774 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
248.4456 917.2314 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
240.4349 882.1340 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
271.3391 863.6697 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
298.4496 887.3555 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
284.3007 920.4585 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 4 0 0 0 0
2 3 4 0 0 0 0
3 4 4 0 0 0 0
4 5 4 0 0 0 0
5 6 4 0 0 0 0
6 7 4 0 0 0 0
7 1 4 0 0 0 0
7 8 4 0 0 0 0
8 9 4 0 0 0 0
9 10 4 0 0 0 0
10 6 4 0 0 0 0
M END
More recent MDL formats have become more powerful. The V3000 format can do much more than the V2000 format, or even the pre-V2000 format.
Here’s a recipe with inline comments:
Script 11.21 code/InputMDLMolfiles.groovy
reader = new MDLV2000Reader(
new File("data/azulene4.mol").newReader(),
Mode.RELAXED
);
azulene = reader.read(new AtomContainer());
// perceive atom types
AtomContainerManipulator
.percieveAtomTypesAndConfigureAtoms(
azulene
)
// add missing hydrogens
adder.addImplicitHydrogens(azulene);
// if bond order 4 was present,
// deduce bond orders
Kekulization.kekulize(azulene);
println "Atom count: " + azulene.atomCount
doubleBondCount = 0
singleBondCount = 0
for (bond in azulene.bonds()) {
if (bond.order == Order.DOUBLE)
doubleBondCount++
if (bond.order == Order.SINGLE)
singleBondCount++
}
println "Single bonds: " + singleBondCount
println "Double bonds: " + doubleBondCount
This code will perceive CDK atom types. These types are needed to add the missing hydrogens, as well as to resolve the bond order information. The input has ten atoms and eleven bonds, all marked with bond order 4.
The result of the above post-processing is:
Atom count: 10
Single bonds: 6
Double bonds: 5
SDF files are quite similar to MDL files but can have an arbitrary number of chemical structures and have properties (see also Section 21.3). An example file from ChEMBL [7]:
RDKit 2D
12 12 0 0 0 0 0 0 0 0999 V2000
0.0000 -4.1250 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0
1.4314 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.7136 -1.2334 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7136 -0.4084 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7136 -2.8834 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -1.6500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4314 -1.6500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -2.4750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4314 -2.4750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7136 -3.7084 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1450 -0.4084 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 11 1 0
2 5 1 0
2 12 1 0
3 5 2 0
4 5 1 0
4 7 2 0
4 8 1 0
6 9 2 0
6 10 1 0
6 11 1 0
7 9 1 0
8 10 2 0
M END
> <chembl_id>
CHEMBL3183843
> <chembl_pref_name>
None
We can read this file and extract the property with the following approach:
Script code/SDFWithProperties.groovy
while (iterator.hasNext()) {
mol = iterator.next()
println mol.getProperty("chembl_id")
}
This extracts the chembl_id
property:
CHEMBL3183843