cdkbook

Substructure Searching

Fingerprints

Substructure searching is a relatively slow algorithm, and the time required to compare two molecules scales with the number of atoms in each molecule. To reduce the computation time, molecular fingerprints were invented. There are two key aspects to fingerprints that make them efficient: first, they have a fixed length so that the time to compare two molecule is independent of the size of the two structures; secondly, the fingerprint of a substructure always matches the fingerprint of any molecules that has that substructure.

In this section we will see two fingerprint types available in the CDK: a substructure based fingerprint, and a path based fingerprint. Before I will explain how these fingerprints are created, we will first look at the BitSet class that is used by the CDK to represent these fingerprints. Consider this code:

Script 15.1 code/BitSetDemo.groovy

bitset = new BitSet(10);
println "Empty bit set: $bitset";
bitset.set(3);
bitset.set(7);
println "Two bits set: $bitset";

If we analyze the output, we see that all set bits are listed, and that all other bits are not:

Empty bit set: {}
Two bits set: {3, 7}

Let us now consider a simple substructure fingerprint of length four with the following bit definitions:

Let’s call this fingerprinter SimpleFingerprinter:

Script 15.2 code/SimpleFingerprinter.java

public class SimpleFingerprinter implements IFingerprinter {
  Map<String,Integer> map = new HashMap<String,Integer>() ;
  public BitSet getFingerprint(IAtomContainer molecule) {
    BitSet bitSet = new BitSet(getSize());
    for (IAtom atom : molecule.atoms()) {
      if (map.containsKey(atom.getSymbol()))
        bitSet.set(map.get(atom.getSymbol()));
    }
    return bitSet;
  }
  public Map<String,Integer> getRawFingerprint(
    IAtomContainer molecule
  ) {
    Map<String,Integer> fingerprint =
      new HashMap<String,Integer>();
    for (String key : map.keySet()) {
      fingerprint.put(key, 0);
    }
    for (IAtom atom : molecule.atoms()) {
      int count = map.get(atom.getSymbol());
      fingerprint.put(atom.getSymbol(), count+1);
    }
    return fingerprint;
  }
  public ICountFingerprint getCountFingerprint(
    IAtomContainer molecule
  ) throws CDKException {
    return new IntArrayCountFingerprint(
      getRawFingerprint(molecule)
    );
  }
  public IBitFingerprint getBitFingerprint(
    IAtomContainer molecule
  ) throws CDKException {
    return new BitSetFingerprint(
      getFingerprint(molecule)
    );
  }
  public int getSize() {
    return 4;
  }
  public String getVersionDescription() { return ""; }
}

We can then calculate the fingerprints for ethanol and benzene:

Script 15.3 code/SimpleFingerprintDemo.groovyl

fingerprinter = new SimpleFingerprinter();
println "ethanol: " + fingerprinter.getFingerprint(ethanol)
println "benzene: " + fingerprinter.getFingerprint(benzene)

and we get these bit sets:

ethanol: {1, 3}
benzene: {1}

Now, we can replace the presence of a particular atom, by the presence of a substructure, such as a phenyl or a carbonyl group. We have then defined a substructure fingerprint.

The CDK has several kinds of fingerprints, including path-based fingerprints (Fingerprinter and HybridizationFingerprinter), a MACSS fingerprint (MACSSFingerprinter) [1], and the PubChem fingerprint (PubChemFingerprinter). These fingerprints have been used for various tasks, including ligand classification [2], and databases like BRENDA [3] and TIN [4].

MACCS Fingerprints

One substructure-based fingerprinter is the MACCSFingerprinter which partly implements the MACSS fingerprint specification [1]. The substructures are defined as SMARTS substructure specifications, inherited from RDKit (http://rdkit.org/). For this fingerprint it is required the implicit hydrogen counts are first set:

Script 15.4 code/MACCSFingerprint.groovy

fingerprinter = new MACCSFingerprinter();
println "ethanol: " +
  fingerprinter.getBitFingerprint(ethanol).
    asBitSet()

The object returned by the getBitFingerprint method is the IBitFingerprint which we can convert into a Java BitSet with the asBitSet method:

ethanol: {81, 108, 113, 138, 152, 154, 156, 159, 163}

ECFP and FCFP Fingerprints

The CDK also has an implementation for the circular ECFP and FCFP fingerprints [5]. These are developed by Alex M. Clark at Collaborative Drug Discovery, Inc in the CircularFingerprinter [6]. It implements both in four variants: ECFP0, ECFP2, ECFP4, ECFP6, FCFP0, FCFP2, FCFP4, and FCFP6. The code is quite similar as for other fingerprints, but we do have to indicate what variant we want:

Script 15.5 code/ECFPFingerprint.groovy

fingerprinter = new CircularFingerprinter(
  CircularFingerprinter.CLASS_ECFP6
);
println "ethanol: " +
  fingerprinter.getBitFingerprint(ethanol).
    asBitSet()

Again we get an IBitFingerprint resulting in a BitSet of bits:

ethanol: {152, 325, 625, 740, 947, 993}

References

  1. Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys for use in drug discovery. JCICS. 2002 Nov 1;42(6):1273–80. doi:10.1021/CI010132R (Scholia)
  2. Chao, Wang L, Xie X-Q. Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS) and its application on modeling ligand functionality for 5HT-subtype GPCR families. JCIM. 2011 Mar 7;51(3):521–31. doi:10.1021/CI100399J (Scholia)
  3. Schomburg I, Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, et al. BRENDA, the enzyme database: updates and major new developments. NAR. 2004 Jan 1;32(Database issue):D431-3. doi:10.1093/NAR/GKH081 (Scholia)
  4. Dorschner KV, Toomey D, Brennan MP, Heinemann T, Duffy FJ, Nolan KB, et al. TIN-a combinatorial compound collection of synthetically feasible multicomponent synthesis products. JCIM. 2011 Apr 15;51(5):986–95. doi:10.1021/CI100443X (Scholia)
  5. D R, M H. Extended-connectivity fingerprints. JCIM. 2010 May 24;50(5):742–54. doi:10.1021/CI100050T (Scholia)
  6. Clark AM, Sarker M, Ekins S. New target prediction and visualization tools incorporating open source molecular fingerprints for TB Mobile 2.0. J Cheminform. 2014;6(1):38. doi:10.1186/S13321-014-0038-2 (Scholia)