Big Data/Analytics Zone is brought to you in partnership with:

Davy Suvee is the founder of Datablend. He is currently working as an IT Lead/Software Architect in the Research and Development division of a large pharmaceutical company. Required to work with big and unstructured scientific data sets, Davy gathered hands-on expertise and insights in the best practices on Big Data and NoSql. Through Datablend, Davy aims at sharing his practical experience within a broader IT environment. Davy is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Counting Triangles Smarter: or, How to Beat Big Data Vendors at Their Own Game

02.25.2013
| 2517 views |
  • submit to reddit

A few months ago, I discovered Vertica’s Counting Triangles”-article through Prismatic. The blog post describes a number of benchmarks on counting triangles in large networks. A triangle is detected whenever a vertex has two adjacent vertices that are also adjacent to each other. Imagine your social network; if two of your friends are also friends with each other, the three of you define a friendship triangleCounting all triangleswithin a large network is a rather compute-intensive task. In its most naive form, an algorithm iterates through all vertices in the network, retrieving the adjacent vertices of their adjacent vertices. If one of the vertices adjacent to the latter vertices is identical to the origin vertex, we identified a triangle.

The Vertica article illustrates how to execute an optimised implementation of the above algorithm throughHadoop and their own Massive Parallel Processing (MPP) Database product (both being run on a 4-node cluster). The dataset involves the LiveJournal social network graph, containing around 86 million relationships, resulting in around 285 million identified triangles. As can be expected, the Vertica solution shines in all respects (counting all triangles in 97 seconds), beating the Hadoop solution by a factor of 40. A few weeks later, the Oracle guys published a similar blog post, using their ExaData platform, beating Vertica’s results by a factor of 7, clocking in at 14 seconds.

Although Vertica and Oracle’s results are impressive, they require a significant hardware setup of 4 nodes, each containing 96GB of RAM and 12 cores. My challenge: beating the Big Data vendors at their own game by calculating triangles through a smarter algorithm that is able to deliver similar performance on commodity hardware (i.e. my MacBook Pro Retina).

1. Doing it the smart way

The LiveJournal social network graph, about 1.3GB in raw size, contains around 86 million relationships. Each line in this file declares a relationship between a source and target vertex (where each vertex is identified by an unique id). Relationships are assumed to be bi-directional: if person 1 knows person 2, person 2 also knows person 1.

0 1
0 7
0 8
0 9
1 2
1 5
2 0
2 5
2 7489

Let’s start by creating a row-like data structure for storing these relationships. The key of each row is the id of the source vertex. The row values are the id’s of all target vertices associated with the particular source vertex.

0 - 1 7 8 9
1 - 2 5
2 - 0 5 7479

With this structure in place, one can execute the naive algorithm as described above. Unfortunately, iterating four levels deep will result in mediocre performance. Let’s improve our data structure by indexing each relationship through its lowest key. So, even though the LiveJournal file declares the relationship as being “2 0“, we persist the relationship by assigning the 2-value to the 0-row. (Order doesn’t matter as relationships are bi-directional anyway.)

0 - 1 2 7 8 9
1 - 2 5
2 - 5 7479

Calculating triangles becomes a lot easier (and faster) now. If the key of a row is part of a triangle, its twoadjacent vertices should be in its list of values (as by definition, the row key is the smallest vertex id of the three of them). Hence, we need to check whether we can find edges amongst the vertices contained within each row. So, for each row, we iterate through its list of values. For each of these values, we retrieve the associated row and verify whether one of its values is part of the original source-values. By doing so, we get rid of one expensive for-loop. Nevertheless, the amount of calculations that need to be executed is still close to 2 billion!

sourceValues = sourceRow.getValues();
for (sourceValue : sourceValues) {
  targetValues : rows[sourceValue].getValues();
  for (targetValue : targetValues) {
    if (sourceValues.contains(targetValue) {
      triangles++;
    }
  }
}

2. Persisting the relationships

The data structure as described above is persisted in a custom datastore that we developed at Datablend for powering the similr-engine (a chemical structure search engine). The datastore is fully persistent and optimised for quickly performing set-based operations (intersections, differences, unions, … ). Parsing the 86 million relationships and creating the appropriate in-memory data structure takes around 20 seconds on my MacBook Pro. An additional 4 seconds is required for persisting the entire data structure to the datastore itself. So around25 seconds in total for effectively storing all 86 million relationships. Vertica nor Oracle mention the time it takes to persist the Livejournal dataset within their respective databases. However, I assume it also requires them a few seconds to execute this load-operation.

What about disk usage? The custom Datablend datastore takes the second place, requiring only 37 Mb more compared to Oracle’s Hybrid Columnar Compression version.

1. Oracle HCC Query High:   262 Mb
2. Datablend datastore:     296 Mb (+13 %)
3. Oracle HCC Query Low:    361 Mb (+38 %)
4. Vertica:                 560 Mb (+113 %)

3. Calculating the triangles

The Oracle setup (on a cluster of 4 nodes, each with 96GB of RAM and 12 cores) is able to calculate the 265 million triangles in 14 seconds. The optimised algorithm described above, running on the custom Datablend datastore, takes the first place, clocking in at 9 seconds! The calculation runs fully pararellized on my MacBook Pro Retina and has a peak use of only 2.11 GB of RAM!

1. Datablend datastore:     9 sec                 (Macbook Pro Retina)
2. Oracle:                  14 sec (+55 %)        (4-node cluster)
3. Vertica:                 97 sec (+1077 %)      (4-node cluster)

4. Conclusion

Datablend’s custom datastore is a very specific solution that targets a particular range of Big Data computations. It is in no means as generic and versatile as the MPP database solutions offered by both Vertica and Oracle. Nevertheless, the article tries to illustrate that one does not require a large computing cluster to execute particular Big Data computations. Just use the most appropriate/smart solution to solve the problem in an elegant and fast way. Don’t hesitate to contact us if you have any questions related to similr and/or Datablend.

Published at DZone with permission of Davy Suvee, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)