Extracting RDF from Chem4Word documents
Joe has released the first Chem4Word demo file, and has written about how to extract the CML with Java and with C#.
I haven’t actually gotten around to fiddling with Java, but ran Strigi against it to extract RDF, while having the Strigi-Chemistry plugins installed. This is part of the RDF that came out:
<example-doc.docx>
<http://freedesktop.org/standards/xesam/1.0/core#title>
"acetic acid",
"(8R,9S,10R,13S,14S,17S)- 17-hydroxy-10,13-dimethyl- 1,2,6,7,8,9,11,12,14,15,16,17-dodecahydrocyclopenta[a] phenanthren-3-one",
"testosterone";
<http://freedesktop.org/standards/xesam/1.0/core#version>
"2",
"2";
<http://rdf.openmolecules.net/0.9#atomCount>
"8",
"49";
<http://rdf.openmolecules.net/0.9#bondCount>
"7",
"52";
<http://rdf.openmolecules.net/0.9#molecularFormula>
"C2H4O2",
"C19H28O2";
I believe there is quite some room for improvement, but it’s a start :) Thanx to Joe for posting the public domain test file, so that other projects can start play with the exiting new technology. I should note, however, that I am not running a Microsoft OS nor MS-Word, and the saved documents source are the only way I have access to the CML right now.