NoSQL Zone is brought to you in partnership with:

Davy Suvee is the founder of Datablend. He is currently working as an IT Lead/Software Architect in the Research and Development division of a large pharmaceutical company. Required to work with big and unstructured scientific data sets, Davy gathered hands-on expertise and insights in the best practices on Big Data and NoSql. Through Datablend, Davy aims at sharing his practical experience within a broader IT environment. Davy is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Circos: An Amazing Tool for Visualizing Big Data

03.13.2012
| 19619 views |
  • submit to reddit

Storing massive amounts of data in a NoSQL data store is just one side of the Big Data equation. Being able to visualize your data in such a way that you can easily gain deeper insights, is where things really start to get interesting. Lately, I've been exploring various options for visualizing (directed) graphs, including Circos. Circos is an amazing software package that visualizes your data through a circular layout. Although it's originally designed for displaying genomic data, it allows to create good-looking figures from data in any field. Just transform your data set into a tabular format and you are ready to go. The figure below illustrates the core concept behind Circos. The table's columns and rows are represented by segments around the circle. Individual cells are shown as ribbons, which connect the corresponding row and column segments. The ribbons themselves are proportional in width to the value in the cell.

 

circos

 

When visualizing a directed graph, nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. The proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the key data points within your table. In my case, I want to better understand the flow of visitors to and within the datablend site and blog; where do visitors come from (direct, referral, search, ...) and how do they navigate between pages. The rest of this article details how to 1) retrieve the raw visit information through the Google Analytics API, 2) persist this information as a graph in Neo4J and 3) query and preprocess this data for visualization through Circos. As always, the complete source code can be found on the Datablend public GitHub repository.

 

1. Retrieving your Google Analytics data

Let's start by retrieving the raw Google Analytics data. The Google Analytics data API provides access to all dimensions and metrics that can be queried through the web application. In my case, I'm interested in retrieving the previous page path property for each page view. If a visitor enters through a page outside of the datablend website, the previous page path is marked as (entrance). Otherwise, it contains the internal path. We will use Google's Java Data API to connect and retrieve this information. We are particularly interested in the pagePath, pageTitle, previousPagePath and medium dimensions, while our metric of choice is the number of pageViews. After setting the date range, the feed of entries that satisfy this criteria can be retrieved. For ease of use, we transform this data to a domain entity and filter/clean the data accordingly. If a visit originates from outside the datablend website, we store the specific medium (direct, referral, search, ...) as previous path.

// Authenticate
analyticsService = new AnalyticsService(Configuration.SERVICE);
analyticsService.setUserCredentials(Configuration.CLIENT_USERNAME, Configuration.CLIENT_PASS);

// Create query
DataQuery query = new DataQuery(new URL(Configuration.DATA_URL));
query.setIds(Configuration.TABLE_ID);
query.setDimensions("ga:medium,ga:previousPagePath,ga:pagePath,ga:pageTitle");
query.setMetrics("ga:pageviews");
query.setStartDate(datestring);
query.setEndDate(datestring);

// Execute
DataFeed feed = analyticsService.getFeed(createQueryUrl(date), DataFeed.class);

// Iterate and clean
for (DataEntry entry : feed.getEntries()) {
    String pagepath = entry.stringValueOf("ga:pagePath");
    String pagetitle = entry.stringValueOf("ga:pageTitle");
    String previouspagepath = entry.stringValueOf("ga:previousPagePath");
    String medium = entry.stringValueOf("ga:medium");
    long views = entry.longValueOf("ga:pageviews");
    // Filter the data
    if (filter(pagepath) && filter(previouspagepath) && (!clean(previouspagepath).equals(clean(pagepath)))) {
        // Check criteria are satisfied
        Navigation navigation =  new Navigation(clean(previouspagepath), clean(pagepath), pagetitle, date, views);
        if (navigation.getSource().equals("(entrance)")) {
            // In case of an entrace, save its medium instead
            navigation.setSource(medium);
        }
        navigations.add(navigation);
    }
}

 

2. Storing navigational data as a directed graph in Neo4J

The set of site navigations can easily be stored as a directed graph in the Neo4J Graph Database. Nodes are site paths (or mediums), while relationships are the navigations themselves. We start by retrieving the navigations for a particular date range and retrieve (or lazily create) the nodes representing the source and target paths (or mediums). Next we de-normalize the pageViews metric (for instance, 6 individual relationships will be created for 6 page-views). Although this de-normalization step is not really required, I did so to make sure that the degree of my nodes is correct if I would perform other types of calculations. For each individual navigation relationship, we also store the date of visit.

// Retrieve navigations for a particular date
List navigations = retrieval.getNavigations(date);

// Save them in the graph database
Transaction tx = graphDb.beginTx();

// Iterate and create
for (Navigation nav : navigations) {
    Node source = getPath(nav.getSource());
    Node target = getPath(nav.getTarget());
    if (!target.hasProperty("title")) {
        target.setProperty("title", nav.getTargetTitle());
    }
    for (long i = 0; i < nav.getAmount(); i++) {
        // Duplicate relationships
        Relationship transition = source.createRelationshipTo(target, Relationships.NAVIGATION);
        transition.setProperty("date", date.getTime()); // Save time as long
    }
}

// Commit
tx.success();
tx.finish();

 

3. Creating the Circos tabular data format

The Circos tabular data format is quite easy to construct. It's basically a tab-delimited file with row and column headers. A cell is interpreted as a value that flows from the row entity to the column entity. We will use the Neo4J Cypher query language to retrieve the data of interest, namely all navigations that occurred within a certain time period. Doing so allows us to create historical visualizations of our navigations and observe how visit flow behaviors are changing over time.

// Access the graph database
graphDb = new EmbeddedGraphDatabase("var/analytics");
engine = new ExecutionEngine(graphDb);

// Execute the data range cypher query
Map params = new HashMap();
params.put("fromdate", from.getTime());
params.put("todate", to.getTime());
// Execute the query
ExecutionResult result = engine.execute("START sourcepath=node:index(\"path:*\") " +
                                        "MATCH sourcepath-[r]->targetpath " +
                                        "WHERE r.date >= {fromdate} AND r.date <= {todate} " +
                                        "RETURN sourcepath,targetpath",
                                        params);

 

Next, we create the tab delimited file itself. We iterate through all entries (i.e. navigations) that match our Cypher query and store them in a temporary list. Afterwards, we start building the two-dimensional array by normalizing (i.e. summing) the number of navigations between the source and target paths. At the end, we filter this occurrence matrix on the minimal number of required navigations. This ensures that we will only create segments for paths that are relevant in the total population. As a final step, we print the occurrences matrix as a tab-delimited file. For each path, we will use a shorthand as the Circos renderer seems to have problem with long string identifiers.

// Retrieve the results
Iterator> it = result.javaIterator();
List navigations = new ArrayList();
Map titles = new HashMap();
Set paths = new HashSet();
        
// Iterate the results
while (it.hasNext()) {
    Map record = it.next();
    String source = (String)((Node) record.get("sourcepath")).getProperty("path");
    String target = (String) ((Node) record.get("targetpath")).getProperty("path");
    String targettitle = (String) ((Node) record.get("targetpath")).getProperty("title");
    // Reuse the navigation object as temorary holder
    navigations.add(new Navigation(source, target, targettitle, new Date(), 1));
    paths.add(source);
    paths.add(target);
    if (!titles.containsKey(target)) {
        titles.put(target, targettitle);
    }
}

// Retrieve the various paths
List pathids = Arrays.asList(paths.toArray(new String[]{}));
// Create the matrix that holds the info
int[][] occurences = new int[pathids.size()][pathids.size()];

// Iterate through all the navigations and update accordingly
for (Navigation navigation : navigations) {
    int sourceindex = pathids.indexOf(navigation.getSource());
    int targetindex = pathids.indexOf(navigation.getTarget());
    occurences[sourceindex][targetindex] = occurences[sourceindex][targetindex] + 1;
}

// Matrix build, filter on threshold
for (int i = 0; i < occurences.length; i++) {
    for (int j = 0; j < occurences.length; j++) {
    if (occurences[i][j] < threshold) {
        occurences[i][j] = 0;
    }
}

// Print
printCircosData(pathids, titles, occurences);

 

The text below is a sample of the output generated by the printCircosData method. It first prints the legend (matching shorthands with actual paths). Next it prints the tab-delimited Circos table.

link0 - /?p=411/wp-admin - Storing and querying RDF data in Neo4J through Sail - Datablend
link1 - /?p=1146 - Visualizing RDF Schema inferencing through Neo4J, Tinkerpop, Sail and Gephi - Datablend
link2 - /?p=164 - Big Data / Concise Articles - Datablend
link3 - referral - null
link4 - /?p=1400 - The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework - Datablend
...

data	l0	l1	l2	l3	l4	...
l0	0	0	0	0	0	
l1	0	0	0	0	0	
l2	0	0	0	0	0	
l3	0	594	0	0	197	
l4	0	0	0	0	0 

 

4. Use the Circos power

Although Circos can be installed on your local computer, we will use its online version to create the visualization of our data. Upload your tab-delimited file and just wait a few seconds before enjoying the beautiful rendering of your site's navigation information.

circos

With just a glimpse of an eye we can already see that the l3-segment (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. The outer 3 rings visualize the total amounts of navigations that are leaving and entering this particular path. In case of referrals, no navigations have this path as target (indicated by the empty middle ring). Its total segment count (inner ring) is entirely build up out of navigations that have a referral as source. The l6-segment seems to be the path that attracts the most traffic (around 2500 navigations). This segment visualizes the navigation data related to my "The joy of algorithms and NoSQL: a MongoDB example"-article. Most of its traffic is received through referrals, while a decent amount is also generated through direct (l17-segment) and search (l27-segment) traffic. The l15-segment (my blog's main page) is the only path that receives an almost equal amount of incoming and outgoing traffic.

With just a few tweaks to the Circos input data, we can easily focus on particular types of navigation data. In the figure below, I made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors.

circos

5. Conclusions

In the era of Big Data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. Circos specializes in a very specific type of visualization, but does its job extremely well. I would be delighted to hear about other types of visualizations for directed graphs.

Published at DZone with permission of Davy Suvee, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)