NoSQL Zone is brought to you in partnership with:

Davy Suvee is the founder of Datablend. He is currently working as an IT Lead/Software Architect in the Research and Development division of a large pharmaceutical company. Required to work with big and unstructured scientific data sets, Davy gathered hands-on expertise and insights in the best practices on Big Data and NoSql. Through Datablend, Davy aims at sharing his practical experience within a broader IT environment. Davy is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Redis and Lua: A NoSQL Power-Horse

01.31.2013
| 3855 views |
  • submit to reddit

Recently, I’ve started implementing a number of Redis-based solutions for a Datablend customer. Redis is frequently referred to as the Swiss Army Knife of NoSQL databases and rightfully deserves that title. At its core, it is an in-memory key-value datastore. Values that are assigned to keys can be ‘structured’ through the use of strings, hashes, lists, sets and sorted sets. The power of these simple data structures, combined with its intuitive API, makes Redis a true power-horse for solving various ‘Big Data’-related problems. To illustrate this point, I reimplemented my MongoDB-based molecular similarity search through Redis and its integrated Lua support. As always, the complete source code can be found on the Datablend public GitHub repository.

1. Redis ‘fingerprint’ data model

Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Cheminformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. Chemical similarity can be determined through the use of so-called fingerprints (i.e. linear, substructure chemical patterns of a certain length). Similarity between compounds is identified by calculating the Tanimoto coefficient. This computation involves the calculation of intersections between sets of fingerprints, an operation that is natively supported by Redis.

Our Redis-based data model for storing fingerprints requires three different data-structures:

  1. For each compound, identified by an unique key, we store its set of fingerprints (where each fingerprint is again identified by an unique key).
  2. For each fingerprint, identified by an unique key, we store the set of compounds containing this fingerprint. These fingerprint sets can be conceived as the inverted indexes of the compound sets mentioned above.
  3. For each fingerprint, we store its number of occurrences through a dedicated weight-key.

Fingerprints are calculated by using the jCompoundMapper library. The Jedis (Java Redis Client) library is employed for persisting the data-structures mentioned above. Three simple atomic operations (lines 30, 33 and 35) are sufficient to create both the inverted indexes (compound->fingerprints and fingerprint->compounds) and incrementing the accompanying counters.

RandomAccessMDLReader reader = new RandomAccessMDLReader(new File(...));
EncodingFingerprint fingerprinter = new Encoding2DMolprint();
 
// We will use a pipeline in order to speedup the persisting process
Pipeline p = jedis.pipelined();
 
// Iterate the compounds one by one
for (int i = 0; i < reader.getSize(); i++) {
 
// Retrieve the molecule and the fingerprints for this molecule
Molecule molecule = reader.getMol(i);
FeatureMap fingerprints = new FeatureMap(fingerprinter.getFingerprint(molecule));
 
// Retrieve some of the compound properties we want to use later on
String compound_cid = (String)molecule.getProperty("PUBCHEM_COMPOUND_CID");
 
// Iterate the fingerprints
for (IFeature fingerprint : fingerprints.getKeySet()) {
 
// Check whether we already encountered the feature and create accordingly (
String thefeaturestring = fingerprint.featureToString();
if (!fingerprintlist.contains(thefeaturestring)) {
fingerprintlist.add(thefeaturestring);
}
 
// Get the index of the fingerprint
int fingerprintindex = fingerprintlist.indexOf(thefeaturestring);
 
// Increment the weight of this fingerprint (number of occurences)
p.incr(fingerprintindex + ":w");
// Create the inverted indexes
// Add the fingerprint to the set of fingerprints of this compound
p.sadd((compound_cid + ":f"), fingerprintindex + "");
// Add the compound to the set of compounds of this fingerprint
p.sadd(fingerprintindex + ":c", compound_cid + "");
 
}
// Sync the changes
p.sync();
}

2. Finding similar chemical compounds

For retrieving compounds that satisfy a particular Tanimoto coefficient, we reuse the same principles as outlined in my original MongoDB article. The number of round-trips to the Redis datastore is minimised by implementing the algorithm via the build-in Lua scripting support. We start by retrieving the number of fingerprints of the particular input compound. Based upon that cardinality, we calculate the fingerprints of interest (i.e. the min-set of fingerprints that lead us to compounds that are able to satisfy the Tanimoto coefficient). For this, we need to identify the subset of compound fingerprints that occur the least throughout the entire dataset. Redis allows us to perform this query via a single sort-command; it takes the compound-key as input and sorts the contained fingerprints by employing the value of the external fingerprint weight keys. Out of this sorted set of fingerprints, we sub-select the top x fingerprints of interest. What a powerful and elegant command!

We use the inverted index (fingerprint->compounds) to identify those compounds that are able to satisfy the particular input Tanimoto coefficient. Applying the Redis union-command upon the calculated set of fingerprint keys returns the set of potential compounds. Once this set has been identified, we calculate similarity by making use of the Redis intersect-command. Only compounds that satisfy the Tanimoto restriction are returned.

-- Retrieve the input paramters
local inputCompound = ARGV[1];
local similarity = ARGV[2];
 
-- Get the number of fingerprints of the input compound
local countToFind = redis.call('scard', inputCompound .. ':f');
 
-- Calculate the max, min and number of fingerprints to consider
local maxFingerprints = math.floor(countToFind / similarity);
local minFingerprints = math.floor(countToFind * similarity);
local numberOfFingerprintsToconsider = math.floor(countToFind - minFingerprints);
 
-- Retrieve the fingerprints of interest by subselecting out of the sorted list of fingerprints based upon the number of occurences
local fingerprintsOfInterest = redis.call('sort', inputCompound .. ':f', 'by', '*:w', 'limit', 0, numberOfFingerprintsToconsider + 1);
 
-- Create the set of all fingerprints we are interested in (creating the redis keys along the way)
local fingerprintKeys = {};
for index = 1, #fingerprintsOfInterest do
table.insert(fingerprintKeys, fingerprintsOfInterest[index] .. ':c');
end
 
-- Retreive the set of possible matching compounds by taking the union of the fingerprint -> compound indexes
local possibleCompounds = redis.call('sunion',unpack(fingerprintKeys));
 
-- Calculate the tanimoto coefficient for the list of possible matching compounds with the inputcompound
local results = {};
for index = 1, #possibleCompounds do
-- Check whether the count of fingerprints is within the predefined range
local count = redis.call('scard', possibleCompounds[index] .. ':f');
if count >= minFingerprints and count <= maxFingerprints then
-- Calculate the matching fingerprints by calculating the intersection
local intersectCount = redis.call('sinter', inputCompound .. ':f', possibleCompounds[index] .. ':f');
local tanimoto = #intersectCount / (count + countToFind - #intersectCount);
-- If sufficient similar, add it the set of results
if tanimoto >= tonumber(similarity) then
table.insert(results,possibleCompounds[index]); table.insert(results, '' .. tanimoto);
end
end
end
 
-- Return the results
return results;

3. Conclusion

With 25.000 stored compounds, Redis requires less then 20ms to retrieve compounds that are 70% similar to a particular input compound. Snappier compared to my original MongoDB implementation. In addition, Redis requires less then 1GB of RAM to maintain a live index of the 460.000 PubChem compounds that have at least one associated assay. This allows scientist to host a local instance of the compound datastore, effectively eliminating the need for a dedicated (and expensive) compound database setup.



 

Published at DZone with permission of Davy Suvee, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)