Big Data/Analytics Zone is brought to you in partnership with:

Gary Sieling is a software developer interested in dev-ops, database technologies, and machine learning. He has a computer science degree from the Rochester Institute of Technology. He has worked on many products in the legal and regulatory industries, having worked on and supported several data warehousing applications. Gary is a DZone MVB and is not an employee of DZone and has posted 62 posts at DZone. You can read more from them at their website. View Full User Profile

Identifying Important Keywords with Lunr.js and the Blekko API

06.24.2013
| 2935 views |
  • submit to reddit

Lunr.js is a simple full-text engine in Javascript. Full text search ranks documents returned from a query by how closely they resemble the query, based on word frequency and grammatical considerations – frequently occurring words have minimal effect, whereas if a rare word occurs in a document several times, it boosts the ranking significantly. This hearkens back to the days of ’90s search engines, where keyword stuffing was a valuable SEO tactic.

The ranking formula in full text is called tf-idf, which stands for text frequency / inverse document frequency – an indication of how the relevance is computed. This requires the indexing software to measure frequencies of words within a document, a query, and across the entire corpus. Lunr.js has a series of internal functions and objects to track word frequency, and is easy to customize:

lunr.Index.prototype.idf = function (term) {
  if (this._idfCache[term]) return this._idfCache[term]
 
 // var documentFrequency = this.tokenStore.count(term)
  var documentFrequency = blekko(term),
      idf = 1
 
  if (term === "") documentFrequency = 1
 
  if (documentFrequency > 0) {
    idf = 1 + Math.log(this.tokenStore.length / documentFrequency)
  }
 
  return this._idfCache[term] = idf
};

I thought it’d be interesting to extract word frequency from a search engine – for a small number of documents, it’s hard to get good numbers. The aim is to show “relevant keywords” for website content – this technique has a nice property of tending to ignore very common words, and phrases that have been spammed to death. The code below shows retrieving the numbers from Blekko’s API – to avoid cross-domain AJAX issues I run the queries through a proxy.

function blekko(query) {
  var result = blekko_cache[query];
  if (result !== undefined) return result;
 
  $.ajax({
    url: 'http://www.garysieling.com/poc/lunrkw/proxy.php?query=' + query,
    async: false
  }).done(function (data) {
    if (data === "") {
      result = 1000000000000000000000;
    } else {
      var json = JSON.parse(data);
 
      result = json.universal_total_results;
      if (result) {
        result = result.replace('M', '000000');
        result = result.replace('K', '000');
        result = parseInt(result);
      } else {
        result = 1000000000000000000000;
      }
    }
  });
 
  blekko_cache[query] = result;
 
  return result;
}

To populate the index, go through these steps:

  • Generate a list of unique words.
  • Collect all uses of each word into one ‘document’
  • Stick each batch of words into the index

Note also that I removed the stemmer, otherwise the stems of words are sent to Blekko during the ranking process, which skews the results. No notion of context. This technique has no concept of context – for instance “D3″ is a model of Cadillac, a vitamin, Nikon SLR Model, and a Javascript Library.

var index = lunr(function () {
  this.field('word')
  this.ref('id')
});
 
index.pipeline.remove(lunr.stemmer);
 
var items = text.split(/[ ()'{0123456789}"\[\].:;+$,..-]/);
var words = {};
$.each(items, function (index, word) {
  if (word.length < 4) {
    return;
  }
  if ("" !== word) {
    var lword = word.toLowerCase();
    words[lword] = (words[lword] ? words[lword] : 0) + 1;
  }
});
 
var docs = [];
var id = 0;
$.each(words, function (k, v) {
  var wordlist = '';
  for (var i = 0; i < v * v; i++) {
    wordlist = wordlist + ' ' + k;
  }
 
 
  docs[id] = k;
  index.add({
    id: id++,
    word: wordlist
  });
 
});

To retrieve results, search the lunr index for all results – normally if you send in a blank query, it returns nothing, so I modified it to return all results.

var printed = {};
var topcnt = 250;
$.each(index.search(""),
  function (i, d) {
    var ref = parseInt(d.ref);
    var word = docs[ref];
    if (printed[word]) return;
    if (blekko_cache[word] === undefined) return;
    if (word.substr(word.length - 2) === 'ly') return;
    if (topcnt < 0) return;
 
    topcnt--;
    console.log(word + " (" + blekko_cache[word] + ")")
    printed[word] = true;
  }
);

Here’s what the results look like:

hooks (980)
stumbled (968)
doc_num (963)
splits (957)
paints (948)
parameters (939)
indexed (927)
realm (916)
minifies (915)
python (912)
underscores (907)
unrelated (905)
replacements (903)
irrelevant (900)
closures (888)
unfinished (878)
summaries (877)
algorithms (873)
metrics (870)
painters (869)
manipulation (864)
facet (852)
clone (849)
occurrence (843)
defects (840)
brennan (837)
stains (830)
risen (824)
catenate (823)
richer (821)
packets (815)
commits (804)
mock (802)
sorting (775)
documenting (771)
visualization (768)
twitter (762)
recursed (761)
clicked (760)
lends (757)
hacked (755)
listens (747)
folders (745)
variables (742)
encrypted (736)
differs (736)
litigation (733)
tighter (729)
naive (725)
whipped (718)
smoother (714)
numpages (709)
loser (707)
override (703)
bins (693)
protections (691)
exposes (689)
ceramic (677)
programmer (676)
buttongroup (674)
wrapper (661)
facets (652)
oracle (644)
Published at DZone with permission of Gary Sieling, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)