Big Data/Analytics Zone is brought to you in partnership with:

Michael loves building software; he's been building search engines for more than a decade, and has been working on Lucene as a committer, PMC member and Apache member, for the past few years. He's co-author of the recently published Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things. Michael is a DZone MVB and is not an employee of DZone and has posted 49 posts at DZone. You can read more from them at their website. View Full User Profile

Fun with Lucene's Faceted Search Module

12.12.2012
| 3728 views |
  • submit to reddit

These days faceted search and navigation is common and users have come to expect and rely upon it. 

Lucene's facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice "getting started" examples in his second post

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementationBobo is yet another facet implementation! I'm sure there are more... 

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search. 

Lucene's nightly performance benchmarks

I was curious about the performance of faceted search, so I added date facets, indexed asyear/month/day hierarchy, to the nightly Lucene benchmarks. Specifically I added faceting to allTermQuerys that were already tested, and now we can watch this graph to track our faceted search performance over time. The date field is the timestamp of the most recent revision of each Wikipedia page. 

Simple performance tests

I also ran some simple initial tests on a recent (5/2/2012) English Wikipedia export, which contains 30.2 GB of plain text across 33.3 million documents. By default, faceted search retrieves the counts of all facet values under the root node (years, in this case):

     Date (3994646)
       2012 (1990192)
       2011 (752327)
       2010 (380977)
       2009 (275152)
       2008 (271543)
       2007 (211688)
       2006 (98809)
       2005 (12846)
       2004 (1105)
       2003 (7)
It's interesting that 2012 has such a high count, even though this export only includes the first five months and two days of 2012. Wikipedia's pages are very actively edited! 

The search index with facets grew only slightly (~2.3%, from 12.5 GB to 12.8 GB) because of the additional indexed facet field. The taxonomy index, which is a separate index used to map facets to fixed integer codes, was tiny: only 120 KB. The more unique facet values you have, the larger this index will be. 

Next I compared search performance with and without faceting. A simple TermQuery (party), matching just over a million hits, was 51.2 queries per second (QPS) without facets and 3.4 QPS with facets. While this is a somewhat scary slowdown, it's the worst case scenario: TermQuery is very cheap to execute, and can easily match a large number of hits. The cost of faceting is in proportion to the number of hits. It would be nice to speed this up (patches welcome!). 

I also tested a harder PhraseQuery ("the village"), matching 194 K hits: 3.8 QPS without facets and 2.8 QPS with facets, which is less of a hit because PhraseQuery takes more work to match each hit and generally matches fewer hits. 

Loading facet data in RAM

For the above results I used the facet defaults, where the per-document facet values are left on disk during aggregation. If you have enough RAM you can also load all facet values into RAM using the CategoryListCache class. I tested this, and it gave nice speedups: the TermQuery was 73% faster (to 6.0 QPS) and the PhraseQuery was 19% faster. 

However, there are downsides: it's time-consuming to initialize (4.6 seconds in my test), and not NRT-friendly, though this shouldn't be so hard to fix (patches welcome!). It also required a substantial 1.9 GB RAM, according to Lucene's RamUsageEstimator. We should be able to reduce this RAM usage by switching to Lucene's fast packed ints implementation from the current int[][] it uses today, or by using DocValues to hold the per-document facet data. I just openedLUCENE-4602 to explore DocValues and initial results look very promising. 

Sampling

Next I tried sampling, where the facet module visits 1% of the hits (by default) and only aggregates counts for those. In the default mode, this sampling is used only to find the top N facet values, and then a second pass computes the correct count for each of those values. This is a good fit when the taxonomy is wide and flat, and counts are pretty evenly distributed. I tested that, but results were slower, because the date taxonomy is not wide and flat and has rather lopsided counts (2012 has the majority of hits). 

You can also skip the second pass and then present approximate counts or a percentage value to the user. I tested that and saw sizable gains: the TermQuery was 248% (2.5X) faster (to 12.2 QPS) and the PhraseQuery was 29% faster (to 3.6 QPS). The sampling is also quite configurable: you can set the min and max sample sizes, the sample ratio, the threshold under which no sampling should happen, etc. 

Lucene's facet module makes it trivial to add facets to your search application, and offers useful features like sampling, alternative aggregates, complements, RAM caching, and fully customizable interfaces for many aspects of faceting. I'm hopeful we can reduce the RAM consumption for caching, and speed up the overall performance, over time.

Published at DZone with permission of Michael Mccandless, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)