NoSQL Zone is brought to you in partnership with:

Davy Suvee is the founder of Datablend. He is currently working as an IT Lead/Software Architect in the Research and Development division of a large pharmaceutical company. Required to work with big and unstructured scientific data sets, Davy gathered hands-on expertise and insights in the best practices on Big Data and NoSql. Through Datablend, Davy aims at sharing his practical experience within a broader IT environment. Davy is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

The joy of algorithms and NoSQL: a MongoDB example

08.16.2011
| 15736 views |
  • submit to reddit

In one of my previous blog posts, I debated the superficial idea that you should own billions of data records before you are eligible to use NoSQL/Big Data technologies. In this article, I try to illustrate my point, by employing NoSQL, and more specifically MongoDB, to solve a specific Chemoinformatics problem in a truly elegant and efficient way. The complete source code can be found on the Datablend public GitHub repository.

 

1. Molecular similarity theory

Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Chemoinformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. (This by applying the hypothesis that similar compounds generally exhibit similar biological activities.) Unfortunately, finding substructures in chemical compounds is a NP-complete problem. Hence, calculating similarities for a particular target compound can take a very long time when considering millions of input compounds. Scientist solved this problem by introducing the notion of structural keys and fingerprints.

In case of structural keys, we precompute the answers on a couple of specific questions that try to capture the essential characteristics of a compound. Each answer is assigned a fixed location within a bitstring. At query time, a lot of time is saved by only executing substructure searches for compounds that have compatible structural keys. (Compatibility being computed by making use of efficient bit operators.)
When employing fingerprints, all linear substructure patterns of a certain length are calculated. As the number of potential patterns is huge, it is not possible to assign an individual bit position to each possible pattern (as is done with structural keys). Instead, the fingerprints patterns are used in a hash. The downside of this approach is that, depending of the size of the hash, multiple fingerprint patterns share the same bit position, giving lead to potential false positives.

In this article, we will demonstrate the use of non-hashed fingerprints to calculate compound similarities (i.e. using the raw fingerprints). This approach has two advantages:

  1. We eliminate the chance of false positives
  2. The raw fingerprints can be used in other types of structural compound mining

 

 

2. Molecular similarity practice

Let’s start by showing how to calculate the fingerprints of a chemical compound. Various fingerprinting algorithms are available today. Luckily, we don’t need to implement these algorithms ourselves. The excellent open-source jCompoundMapper library provides us with all the required functionality. It uses MDL SD formatted compounds as input and is able to output a variety of fingerprints. We start by creating a reader for MDL SD files. Afterwards, we use the 2DMol fingerprinter to encode the first molecule in the compounds.sdf file.

RandomAccessMDLReader reader = new RandomAccessMDLReader(new File("compounds.sdf"));
EncodingFingerprint fingerprinter = new Encoding2DMolprint();

Molecule molecule = reader.getMol(0);
FeatureMap fingerprints = new FeatureMap(fingerprinter.getFingerprint(molecule));

for (IFeature fingerprint : fingerprints.getKeySet()) {
    System.out.println(fingerprint.featureToString());
}

The first molecule of the input file has the following chemical structure: C52H87N3O13. Its 2DMol fingerprint contains 120 unique fingerprint patterns, a selection of them shown below:

0[N]1[C O]2[C C C]
0[N]1[C O]2[C C C]3[C C C C C]
0[C]1[C C C]2[C C N O]3[C C C C O O]
0[C]1[C C]2[C C C C O]3[C C N O]
0[O]1[C]2[C O]3[C C C]
0[C]1[C O O]2[C C C O]
0[C]1[C C]2[C C]
0[C]1[C]2[C]3[C O]
0[C]1[C C N]2[C C C C O]3[C C C O]
...

Now the question that remains is how to measure the similarity between the fingerprints of compounds A and B. Several methods are again available, but in our case we will use the so-called Tanimoto association coefficient, which is defined as:

 

 

NA refers to the number of unique fingerprint patterns found in compound A, while NB refers to the number of unique fingerprint patterns found in compound B. NAB specifies the number of unique fingerprint patterns found in both compound A and B. As can be observed from the equation, two identical compounds would have a Tanimoto coefficient of 1.0.

 

3. MongoDB datamodel

MongoDB is a so-called document-oriented database. When using document-oriented databases, you basically group together all related data in a single document instead of normalizing your data in various RDBMS tables and using joins to retrieve the required information later on. In case of MongoDB, documents are stored using BSON (binary JSON) and related documents are stored in collections. Let’s start by describing the compounds collection that stores a separate document for each compound. The JSON-document of our C52H87N3O13 compound looks as follows:

{ 
    "compound_cid" : "46200001" , 
    "smiles" : "CCC1C(C(C(C(=NOCC=CCN2CCCCC2)C(CC(C(C(C(C(C(=O)O1)C)OC3CC(C(C(O3)C)O)(C)OC)C)OC4C(C(CC(O4)C)N(C)CC5=CC=CC=C5)O)(C)O)C)C)O)(C)O" ,
    "fingerprint_count" : 120 , 
    "fingerprints" : [ 
        "0[N]1[C O]2[C C C]" ,
        "0[N]1[C O]2[C C C]3[C C C C C]" ,
        "0[C]1[C C C]2[C C N O]3[C C C C O O]" ,
        "0[C]1[C C]2[C C C C O]3[C C N O]" ,
        "0[O]1[C]2[C O]3[C C C]" , 
        "0[C]1[C O O]2[C C C O]" , 
        "0[C]1[C C]2[C C]" , 
        "0[C]1[C]2[C]3[C O]" , 
        "0[C]1[C C N]2[C C C C O]3[C C C O]" ,
        ... ] , 
}

For each compound, we store its unique PubChem Id and SMILES representation. Additionally, we store the full set of unique fingerprint patterns as an array of strings. As an efficiency shortcut, we also store the total amount of fingerprint patterns as a separate property. Hence, this number does not need to be computed at run-time while querying. By storing all compound information in a single document, queries can be performed more efficiently. This in comparison with the more traditional RDBMS model where a join would be required to retrieve the information stored in a separate compound and fingerprint table. The Java Driver for MongoDB makes it really easy to create this JSON document for each compound. Once we retrieved a molecule through the MDL file reader, it is just a matter of creating the necessary document objects and inserting them in the compounds collection.

// Retrieve the molecule
Molecule molecule = ...;
FeatureMap fingerprints = new FeatureMap(fingerprinter.getFingerprint(molecule));

// Create the new document that will hold the compound
BasicDBObject compound = new BasicDBObject();

// Add the simple properties one by one to this compound object
compound.put(COMPOUNDCID_PROPERTY, molecule.getProperty("PUBCHEM_COMPOUND_CID"));
compound.put(SMILES_PROPERTY, molecule.getProperties().get("PUBCHEM_OPENEYE_CAN_SMILES"));
compound.put(FINGERPRINTCOUNT_PROPERTY, fingerprints.getSize());

// Iterate the fingerprints
ArrayList fingerprintstosave = new ArrayList();
for (IFeature fingerprint : fingerprints.getKeySet()) {
    fingerprintstosave.add(fingerprint.featureToString());
}

// Add the full fingerprint list to the compound document
compound.put(FINGERPRINTS_PROPERTY, fingerprintstosave.toArray());

// Compound document is created, insert it.
compoundsCollection.insert(compound);

To complete our MongoDB data model, we will add the fingerprintscount collection. The rationale for this collection will be explained in the next section of this article. For now, just consider it to be a collection that stores the number of times a particular fingerprint pattern was encountered when importing the compound data. The listing below shows an extract of the fingerprintscount collection.

{ 
    "fingerprint" : "0[N]1[C O]2[C C C]",
    "count" : 472
}
{ 
    "fingerprint" : "0[N]1[C O]2[C C C]3[C C C C C]",
    "count" : 41
}
{
    "fingerprint" : "0[O]1[C]2[C O]3[C C C]",
    "count" : 1343
} 

To update this collection, we make use of MongoDB’s increment operator in combination with an upsert. This way, whenever a fingerprint pattern is encountered for the first time, a document is automatically created for the fingerprint and its count is put on 1. If document already exists for this fingerprint pattern, its associated count is incremented by 1. Pretty easy!

// Retrieve the molecule
Molecule molecule = ...;
FeatureMap fingerprints = new FeatureMap(fingerprinter.getFingerprint(molecule));

// Iterate the fingerprints
for (IFeature fingerprint : fingerprints.getKeySet()) {

    // Create a +1 increment count for this fingerprint
    BasicDBObject countplusone = new BasicDBObject();
    countplusone.put(COUNT_PROPERTY,1);
    BasicDBObject increment = new BasicDBObject();
    increment.put("$inc", countplusone);

    // Create the fingerprint document itself
    BasicDBObject the_fingerprint = new BasicDBObject();
    the_fingerprint.put(FINGERPRINT_PROPERTY, fingerprint.featureToString());

    // Perform an upsert
    fingerprintCountsCollection.update(the_fingerprint,increment,true,false);
}

For enhancing query performance, we need to define indexes for the appropriate document properties. By doing so, the created B-Tree indexes are used by MongoDB ‘s query optimizer to quickly find back the required documents instead of needing to scan through all of them. Creating these indexes is very easy as illustrated below.

compoundsCollection.ensureIndex(new BasicDBObject().append(FINGERPRINTS_PROPERTY, 1));
compoundsCollection.ensureIndex(new BasicDBObject().append(FINGERPRINTCOUNT_PROPERTY, 1));

fingerprintCountsCollection.ensureIndex(new BasicDBObject().append(FINGERPRINT_PROPERTY, 1));

 

4. MongoDB molecular similarity query

It’s time to bring it all together. Imagine we want to find all compounds that have a Tanimoto coefficient of 0.8 for a particular input compound. As MongoDB’s querying speed is quite fast we could try to brute-force it and basically compute the Tanimoto coefficient for each compound document that is stored in the compounds collection. But let’s try to do it a bit smarter. By looking at the Tanimoto coefficient equation, we can already narrow the search space significantly. Imagine our input compound (A) has 40 unique fingerprint patterns. Let’s fill in some of the parameters in the Tanimoto equation:

 

 

From this equation, we can deduct the minimum and maximum number of unique fingerprint patterns another compound should have in order to (in the best case) satisfy the equation:

 

 

Hence, the compound we are comparing with should only be considered if it has between 32 and 50 unique fingerprint patterns. By taking this into account in our search query, we can narrow the search space significantly. But we can optimize our query a bit further. Imagine we would query for all documents that share at least 1 of 9 unique fingerprints patterns with the input compound. All documents that are not part of the resultset of this query will never be able to reach the Tanimoto coefficient of 0.8, as the maximum of possibly remaining shared fingerprint patterns would be 31, and

 

 

were we replaced NB by 31 to maximize the resulting Tanimoto coefficient. This 9-number can be obtained by solving the following equation:

 

 

Which 9 fingerprint patterns of the input compound should we consider in this query? Common sense tells us the pick the 9 fingerprint patterns for which the occurrence in the total compound population is the lowest, as this would narrow our search space even further. (Hence the need for the fingerprintscount collection to be able to quickly retrieve the counts of the individual fingerprint patterns.) The code listing below illustrates how to retrieve the counts for the individual patterns. Using the MongoDB QueryBuilder we create the query object that, using the in-operator, specifies that a document from the fingerprintscount collection should only be returned if its fingerprint pattern is part of the fingerprint patterns we are interested in. Using the additional sort-operator, we retrieve the individual fingerprint patterns in ascending count-order.

// Find all fingerprint count documents that have a fingerprint in the list
DBObject fingerprintcountquery =   
    QueryBuilder.start(FINGERPRINT_PROPERTY).in(fingerprintsToFind.toArray()).get();

// Sort the result on count
DBObject fingerprintcountsort = 
    QueryBuilder.start(COUNT_PROPERTY).is(1).get();

// Execute the query on the fingerprint counts collection
DBCursor fingerprintcounts = 
    fingerprintCountsCollection.find(fingerprintcountquery).sort(fingerprintcountsort);

It’s time for the final step, namely retrieving all compound documents that have the potential to satisfy a Tanimoto coefficient of for instance 0.8. For doing this, we build a query that retrieves all compounds that have at least one pattern out of the sublist of fingerprint patterns we calculated in the previous step. Additionally, only compounds that have a total count of fingerprint patterns that is within the minimum and maximum range should be returned. Although this optimized query drastically narrows the search space, we still need to compute the Tanimoto coefficient afterwards to ensure that it has a value of 0.8 or above.

// Calculate the essential numbers
int maxnumberofcompoundfingerprints = (int) (fingerprintsToFind.size() / 0.8);
int minnumberofcompoundfingerprints = (int) (fingerprintsToFind.size() * 0.8);
int numberoffingerprintstoconsider = fingerprintsToFind.size() - minnumberofcompoundfingerprints + 1;

List fingerprintsToConsider = fingerprintsToFind.subList(0,numberoffingerprintstoconsider);

// Find all compounds that satisfy the specified conditions
DBObject compoundquery =   
    QueryBuilder.start(FINGERPRINTS_PROPERTY).in(fingerprintsToConsider)
                .and(FINGERPRINTCOUNT_PROPERTY).lessThanEquals(maxnumberofcompoundfingerprints)
                .and(FINGERPRINTCOUNT_PROPERTY).greaterThanEquals(minnumberofcompoundfingerprints)
                .get();

// Execute the query
DBCursor compounds = compoundsCollection.find(compoundquery);

// Let's process the found compounds locally
while(compounds.hasNext()) {

    DBObject compound = compounds.next();
    // Retrieve all fingerprints of the compound
    BasicDBList fingerprints = ((BasicDBList) compound.get(FINGERPRINTS_PROPERTY));
    // Calculate the intersection on the total list of fingerprints
    fingerprints.retainAll(fingerprintsToFind);

    int sharecount = fingerprints.size();
    int totalcount = (Integer)compound.get(FINGERPRINTCOUNT_PROPERTY);
    // Calculate the Tanimoto coefficient
    double tanimoto = (double) sharedcount / (totalcount + fingerprintsToFind.size() - sharedcount);
    
    // We still need to check whether the tanimoto is really >= the required similarity
    if (tanimoto >= 0.8) {
        System.out.println(compound.get(COMPOUNDCID_PROPERTY) + " " + (int)(tanimoto * 100) + "%");
    }

}

 

5. Conclusion

The MongoDB implementation is able to easily screen a database of a million compounds in subsecond time. The required calculation time however, largely depends on the target Tanimoto: a target Tanimoto of 0.6 will result in significantly more results compared to a target Tanimoto of 0.8. When using the MongoDB explain functionality, one can observe that the query time is rather short, compared to the time that is required to transfer the data to the Java application and execute the Tanimoto calculations. In my next article, I will discuss the use of MongoDB’s build-in map-reduce functionality to keep the Tanimoto calculation local to the compound data and hence improve overall performance.

References
Published at DZone with permission of Davy Suvee, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)