DevOps Zone is brought to you in partnership with:

Gary Sieling is a software developer interested in dev-ops, database technologies, and machine learning. He has a computer science degree from the Rochester Institute of Technology. He has worked on many products in the legal and regulatory industries, having worked on and supported several data warehousing applications. Gary is a DZone MVB and is not an employee of DZone and has posted 62 posts at DZone. You can read more from them at their website. View Full User Profile

Building a full-text index of git commits using lunr.js and Github APIs

05.20.2013
| 5321 views |
  • submit to reddit

Github has a nice API for inspecting repositories – it lets you read gists, issues, commit history, files and so on. Git repository data lends itself to demonstrating the power of combining full text and faceted search, as there is a mix of free text fields (commit messages, code) and enumerable fields (committers, dates, committer employers). Github APIs return JSON, which has the nice property of resembling a tree structure – results can be recursed over without fear of infinite loops. Note that to download the entire commit history for a repository, you need to page through it by sha hash. The API I use here lacks diffs, which must be retrieved elsewhere.

To test this, access a URL like so. The configurable arguments are the repository owner and name fields.
https://api.github.com/repos/torvalds/linux/commits

This is what a commit looks like:

{
  "sha": "7638417db6d59f3c431d3e1f261cc637155684cd",
  "url": "https://api.github.com/repos/octocat/Hello-World/git/commits/7638417db6d59f3c431d3e1f261cc637155684cd",
  "author": {
    "date": "2008-07-09T16:13:30+12:00",
    "name": "Scott Chacon",
    "email": "schacon@gmail.com"
  },
  "committer": {
    "date": "2008-07-09T16:13:30+12:00",
    "name": "Scott Chacon",
    "email": "schacon@gmail.com"
  },
  "message": "my commit message",
  "tree": {
    "url": "https://api.github.com/repos/octocat/Hello-World/git/trees/827efc6d56897b048c772eb4087f854f46256132",
    "sha": "827efc6d56897b048c772eb4087f854f46256132"
  },
  "parents": [
    {
      "url": "https://api.github.com/repos/octocat/Hello-World/git/commits/7d1b31e74ee336d15cbd21741bc88a537ed063a0",
      "sha": "7d1b31e74ee336d15cbd21741bc88a537ed063a0"
    }
  ]
}

To make the test simple, I download these as JSON locally, then start a python webserver. Were I to make many such calls on a public site, I’d set up a proxy to the github APIs.

python -m SimpleHTTPServer

This data has a number of nested objects and must be flattened to fit into the lunr.jsfull-text index. This example uses the commit number (0, 1, 2..N) as the location in the index, but a real environment should use the commit hash to allow partitioning the ingestion process. Nested objects are flattened by joining subsequent keys with underscores in between. A production-worthy solution needs to escape these to prevent collisions.

var documents = [];
 
function recurse(doc_num, base, obj, value) {
  if ($.isPlainObject(value)) {
    $.each(value, function (k, v) {
      recurse(doc_num, base + obj + "_", k, v);
    });
  } else {
    process(doc_num, base + obj, value);
  }
}
 
function process(doc_num, key, value) {
  if (documents.length <= doc_num)
    documents[doc_num] = {};
 
  if (value !== null)
    documents[doc_num][key] = value + '';
}
 
$.each(data, function(doc_num, commit) {
  $.each(commit, function(k, v) {
    recurse(doc_num, '', k, v)
  });
});

Normally, one sets up a lunr full-text index by specifying all the fields, much like Solr’s numerous XML config files. Lunr doesn’t have nearly as many configuration options, since you only specify the ‘boost’ parameter to increase the value of certain fields in ranking. I imagine this will change as the project grows, at the very least to include type hints.

Given the simplicity of field objects, you can infer infer the field list from JSON payloads. The code below provides two modes, one where you inspect the entire JSON payload, or one where you limit how many commits you check, a good option when JSON data is consistent.

The function accepts configuration objects resembling ExtJS config objects, which lets you override as desired. If fields derived from existing data are required, they can be inserted after any documents are inserted.

function inferIndex(documents, config) {
  return lunr(function() {
    this.ref('id');
    var found = {};
    var idx = this;
 
    $.each(documents,
      function(doc_num, doc) {
 
        if (config &&
            config.limit &&
            config.limit < doc_num)
          return;
 
        $.each(doc, function(k, v) {
          if (!found[k]) {
            if (config && config[k]) {
              idx.field(k, config[k]);
            } else {
              idx.field(k);
            }
            found[k] = true;
          }
        });
    });
  });
}
 
var index =
  inferIndex(documents,
    {limit: 1,
     'commit_author_name':{boost:10}});

Inserting flattened documents into the index becomes simple. The method below provides a callback, should you desire to add calculated fields fields.

$.each(documents,
  function(doc_num, attrs, doc_cb) {
    var doc =
      $.extend(
        {id: doc_num}, attrs);
 
    if (doc_cb) {
      doc = doc_cb(doc);
    }
 
    index.add(doc);
});

At this point we’ve indexed the entire commit history from a git repository, which lets us search for commits by topic. While this is useful, it’d be really nice to be able to facet on fields, which would return the number of documents in a category, like a SQL group by. I’ve found it particularly convenient to facet on author, date, or author’s company.

If you have access to the original documents, you can easily construct facets based on the results of a lunr search:

function facet(index, query, data, field) {
  var results = index.search(query);
 
  var facets = {};
  $.each(results, function(index, searchResult) {
    var doc = data[searchResult.ref];
 
    facets[doc[field]] =
      (facets[doc[field]] === undefined ? 0 :
      facets[doc[field]]) + 1; } );
 
  return facets;
}

Commit messages in repositories where I work often contain names of clients who requested a feature or bug fix. Consequently doing a search faceted by author provides a list of who worked with each client the most – this can also tell you who has worked with various pieces of technology.

The following query demonstrates this technique:

var facets =
   facet(index,
        'driver',
        documents,
        'commit_author_name');
{"Wolfram Sang":24,"Linus Torvalds":3}

The approach shown here works well, but requires retrieving results requires access to the original document data. If we want to filter the results to a category, we need a richer search API than lunr currently provides, as well as callback options within the search API. In Solr there are also options to skip lower-casing data, as that may be inappropriate for category titles. Mitigating these issues will be explored further in future essays.

Published at DZone with permission of Gary Sieling, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)