Extracting RDF from Chem4Word documents

Joe has released the first Chem4Word demo file, and has written about how to extract the CML with Java and with C#.

I haven’t actually gotten around to fiddling with Java, but ran Strigi against it to extract RDF, while having the Strigi-Chemistry plugins installed. This is part of the RDF that came out:

<example-doc.docx>
  <http://freedesktop.org/standards/xesam/1.0/core#title>
    "acetic acid",
    "(8R,9S,10R,13S,14S,17S)- 17-hydroxy-10,13-dimethyl- 1,2,6,7,8,9,11,12,14,15,16,17-dodecahydrocyclopenta[a] phenanthren-3-one",
    "testosterone";
  <http://freedesktop.org/standards/xesam/1.0/core#version>
    "2",
    "2";
  <http://rdf.openmolecules.net/0.9#atomCount>
    "8",
    "49";
  <http://rdf.openmolecules.net/0.9#bondCount>
    "7",
    "52";
  <http://rdf.openmolecules.net/0.9#molecularFormula>
    "C2H4O2",
    "C19H28O2";

I believe there is quite some room for improvement, but it’s a start :) Thanx to Joe for posting the public domain test file, so that other projects can start play with the exiting new technology. I should note, however, that I am not running a Microsoft OS nor MS-Word, and the saved documents source are the only way I have access to the CML right now.

cml java rdf chem4word strigi