The CDK supports a subset of CxSMILES only and this chapter should give an overview over how far the support goes. Also refer to this discussion between Emma Schymanski and John Mayfield.
CxSMILES for lipids can use the [H]C\C=C\CC(=O)O |Sg:n:1:x:ht,Sg:n:4:y:ht|
syntax.
We can parse this into a CDK data model with the regular approach.
In Groovy this looks like this:
Script code/ParseCXSMILES.groovy
sp = new SmilesParser(
SilentChemObjectBuilder.getInstance()
)
mol1 = sp.parseSmiles("[H]C\\C=C\\CC(=O)O |Sg:n:1:x:ht,Sg:n:4:y:ht|")
However, we can also use pybacting and parse the CxSMILES with Bacting [1] (which uses the CDK in turn) in Python:
from pybacting import cdk
from scyjava import to_python as j2p
import IPython
cxmol = cdk.fromSMILES("[H]C\\C=C\\CC(=O)O |Sg:n:1:x:ht,Sg:n:4:y:ht|")
svg = j2p(cdk.asSVG(cxmol))
display(IPython.display.SVG(svg))
We can even open this as a Jupyter Notebook in Google Colab.
Finally, the online CDKDepict can parse and process CxSMILES strings for you without any coding.
After the CxSMILES is parsed by the CDK, it is stored in memory is an IAtomContainer. This IAtomContainer is the CDK data model for bonded atoms. Most commonly it is used to store the chemical graph in which atoms are represented as vertices and bonds as edges.
We can list the atoms and bonds with this code:
Script code/DataModel.groovy
println("The atoms:")
for (atom : mol.atoms()) {
println("" + atom.getSymbol() + " with " + atom.getImplicitHydrogenCount() + " hydrogens")
}
println("The bonds:")
for (bond : mol.bonds()) {
println("" + bond.getOrder() + " between " + bond.getAtomCount() + " atoms")
}
which lists:
The atoms:
H with 0 hydrogens
C with 2 hydrogens
C with 1 hydrogens
C with 1 hydrogens
C with 2 hydrogens
C with 0 hydrogens
O with 0 hydrogens
O with 1 hydrogens
The bonds:
SINGLE between 2 atoms
SINGLE between 2 atoms
DOUBLE between 2 atoms
SINGLE between 2 atoms
SINGLE between 2 atoms
DOUBLE between 2 atoms
SINGLE between 2 atoms
The other details are stored as properties on the atom container referring to the atoms and bonds they apply to. As such, it can be seen as coloring the graph:
Script code/SGroups.groovy
println("The S groups:")
sgroups = mol.getProperty(CDKConstants.CTAB_SGROUPS);
for (sgroup : sgroups) {
if (sgroup.getType() == SgroupType.CtabStructureRepeatUnit) {
println("" + sgroup.type + " with " + sgroup.getAtoms().size() + " atoms and " + sgroup.getBonds().size() + " bonds")
}
}
which lists:
The S groups:
CtabStructureRepeatUnit with 1 atoms and 2 bonds
CtabStructureRepeatUnit with 1 atoms and 2 bonds
We can also write a CXSMILES to a SD file. We can use the following code for this:
Script code/WriteSDF.groovy
cxSMILES = "O=C(*)Oc1ccc(cc1)C(C)(C)c1ccc(O*)cc1 |Sg:n:0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20::ht|"
mol = sp.parseSmiles(cxSMILES)
writer = new FileWriter(new File("polybisacarb.sdf"))
SDFWriter sdfWriter = new SDFWriter(writer);
sdfWriter.write(mol);
The output uses the *
as pseudoatom as in the input SMILES and exports the
CXSMILES aspects as additional annotation after the mol block:
CDK 05012421552D
21 22 0 0 0 0 0 0 0 0999 V2000
-5.2108 -1.5321 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-3.9180 -0.7714 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.9304 0.7285 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0
-2.6128 -1.5107 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-1.3200 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0300 -1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.3000 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.3000 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0300 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.3200 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6027 1.4936 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.8980 0.7372 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5725 2.6379 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.6442 2.6474 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.1670 2.3869 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.7922 3.5074 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.2880 4.9671 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.2616 6.1082 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.7602 7.5219 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0
1.1892 5.2276 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1623 4.0283 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
2 3 1 0 0 0 0
2 4 1 0 0 0 0
4 5 1 0 0 0 0
5 6 2 0 0 0 0
6 7 1 0 0 0 0
7 8 2 0 0 0 0
8 9 1 0 0 0 0
9 10 2 0 0 0 0
5 10 1 0 0 0 0
8 11 1 0 0 0 0
11 12 1 0 0 0 0
11 13 1 0 0 0 0
11 14 1 0 0 0 0
14 15 2 0 0 0 0
15 16 1 0 0 0 0
16 17 2 0 0 0 0
17 18 1 0 0 0 0
18 19 1 0 0 0 0
17 20 1 0 0 0 0
20 21 2 0 0 0 0
14 21 1 0 0 0 0
M STY 1 1 SRU
M SAL 1 15 13 21 4 6 20 5 7 14 17 8 1 9 18 11 2
M SAL 1 4 10 12 15 16
M SBL 1 2 19 2
M SMT 1 n
M SCN 1 1 HT
M SDI 1 4 -1.6471 7.0407 -0.3747 6.5894
M SDI 1 4 -4.5992 -0.0270 -3.2492 -0.0159
M END
> <PUBCHEM_SUBSTANCE_COMMENT>
O=C(*)Oc1ccc(cc1)C(C)(C)c1ccc(O*)cc1 |Sg:n:0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20::ht|
> <PUBCHEM_EXT_DATASOURCE_REGID>
NISTpolymer0006
> <PUBCHEM_SUBSTANCE_SYNONYM>
Poly(bisphenol-A-carbonate
$$$$