chempyformatics

Input/Output

Reading from Readers and InputStreams

Example: Downloading Domoic Acid from PubChem

As an example, below will follow a small script that takes a PubChem compound identifier (CID) and downloads the corresponding ASN.1 XML file, parses it and counts the number of atoms:

Script code/PubChemDownload.py

cid = 5282253
reader = PCCompoundXMLReader(
  URL(
    "https://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=" + str(cid) +
    "&disopt=SaveXML"
  ).openStream()
)
mol = reader.read(SilentChemObjectBuilder.getInstance().newAtomContainer())
print("CID: {}".format(mol.getProperty("PubChem CID")))
print(f"Atom count: {mol.getAtomCount()}")

It reports:

CID: 5282253
Atom count: 43

PubChem ASN.1 files come with an extensive list of molecular properties. These are stored as properties on the molecule object and can be retrieved using the getProperties() method, or, using the Groovy bean formalism:

Script code/PubChemDownloadProperties.py

cid = 5282253
reader = PCCompoundXMLReader(
  URL(
    "https://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=" + str(cid) +
    "&disopt=SaveXML"
  ).openStream()
)
mol = reader.read(SilentChemObjectBuilder.getInstance().newAtomContainer())
for key, value in mol.getProperties().items():
  print(f"{key}={value}")

which lists the properties for the earlier downloaded domoic acid:

PubChem CID=5282253
Compound Complexity=510
Fingerprint (SubStructure Keys)=00000371E0723800000000000000000000000000000160...
  000000000000000000000000000000001E00100800000D28C18004020802C00200880220D208...
  000000002000000808818800080A001200812004400004D000988003BC7F020E800000000000...
  00000000000000000000000000000000
IUPAC Name (Allowed)=(2S,3S,4S)-3-(carboxymethyl)-4-[(1Z,3E,5R)-5-carboxy-1-me...
  thyl-hexa-1,3-dienyl]pyrrolidine-2-carboxylic acid
IUPAC Name (CAS-like Style)=(2S,3S,4S)-4-[(2Z,4E,6R)-6-carboxyhepta-2,4-dien-2...
  -yl]-3-(carboxymethyl)-2-pyrrolidinecarboxylic acid
IUPAC Name (Markup)=(2<I>S</I>,3<I>S</I>,4<I>S</I>)-4-[(2<I>Z</I>,4<I>E</I>,6<...
  I>R</I>)-6-carboxyhepta-2,4-dien-2-yl]-3-(carboxymethyl)pyrrolidine-2-carbox...
  ylic acid
IUPAC Name (Preferred)=(2S,3S,4S)-4-[(2Z,4E,6R)-6-carboxyhepta-2,4-dien-2-yl]-...
  3-(carboxymethyl)pyrrolidine-2-carboxylic acid
IUPAC Name (Systematic)=(2S,3S,4S)-3-(2-hydroxy-2-oxoethyl)-4-[(2Z,4E,6R)-6-me...
  thyl-7-oxidanyl-7-oxidanylidene-hepta-2,4-dien-2-yl]pyrrolidine-2-carboxylic...
   acid
IUPAC Name (Traditional)=(2S,3S,4S)-3-(carboxymethyl)-4-[(1Z,3E,5R)-5-carboxy-...
  1-methyl-hexa-1,3-dienyl]proline
InChI (Standard)=InChI=1S/C15H21NO6/c1-8(4-3-5-9(2)14(19)20)11-7-16-13(15(21)2...
  2)10(11)6-12(17)18/h3-5,9-11,13,16H,6-7H2,1-2H3,(H,17,18)(H,19,20)(H,21,22)/...
  b5-3+,8-4-/t9-,10+,11-,13+/m1/s1
InChIKey (Standard)=VZFRNCSOCOPNDB-AOKDLOFSSA-N
Log P (XLogP3-AA)=-1.3
Mass (Exact)=311.13688739
Molecular Formula=C15H21NO6
Molecular Weight=311.33
SMILES (Canonical)=CC(C=CC=C(C)C1CNC(C1CC(=O)O)C(=O)O)C(=O)O
SMILES (Isomeric)=C[C@H](/C=C/C=C(/C)\[C@H]1CN[C@@H]([C@H]1CC(=O)O)C(=O)O)C(=O)O
Topological (Polar Surface Area)=124
Weight (MonoIsotopic)=311.13688739

Line Notations

Another common input mechanism in cheminformatics is the line notation. Several line notations have been proposed, including the Wiswesser Line Notation (WLN) [1] and the Sybyl Line Notation (SLN) [2], but the most popular is SMILES [3]. There is a Open Standard around this format called OpenSMILES, available at http://www.opensmiles.org/.

SMILES

The CDK can both read and write SMILES, or at least a significant subset of the line notation. You can parse a SMILES into a IAtomContainer with the SmilesParser. The constructor of the parser takes an IChemObjectBuilder (see Section ??) because it needs to know what CDK interface implementation it must use to create classes. This example uses the DefaultChemObjectBuilder:

Script code/ReadSMILES.py

sp = SmilesParser(Builder.getInstance())
mol = sp.parseSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")
print(f"Aspirin has {mol.getAtomCount()} atoms.")

This outputs:

Aspirin has 13 atoms.

References

  1. Wiswesser WJ. How the WLN began in 1949 and how it might be in 1999. JCICS. 1982 May 1;22(2):88–93. doi:10.1021/CI00034A005 (Scholia)
  2. Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD. SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries. JCIM. 2008 Dec 1;48(12):2294–307. doi:10.1021/CI7004687 (Scholia)
  3. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. JCICS [Internet]. 1988 Feb 1;28(1):31–6. Available from: http://organica1.org/seminario/weininger88.pdf doi:10.1021/CI00057A005 (Scholia)