Big Data/Analytics Zone is brought to you in partnership with:

Doug has been engrossed in programming since his parents first bought him an Apple IIe computer in 4th grade. Throughout his early career, Doug proved his flexibility and ingenuity in crafting solutions in a variety of environments. Doug’s most recent work has been in the telecom industry developing tools to analyze large amounts of network traffic using C++ and Python. Doug loves learning and synthesizing this knowledge into code and blog articles. Doug is a DZone MVB and is not an employee of DZone and has posted 36 posts at DZone. You can read more from them at their website. View Full User Profile

Search Is Eating The World

05.07.2013
| 8713 views |
  • submit to reddit

Much of the crew just got back from Lucene Revolution. It was an incredible experience to hang out with the cream-of-the-crop of the Lucene/Solr community. It continues to be clear that modern applications of all stripes are increasingly driven by search as the primary UI component. Users of these applications expect rich interactivity. And because search is becoming smarter and smarter, search is becoming the centerpiece for interacting with big data and complex applications.

Search As The Primary UI

Google and Siri have trained us all to expect and work with smart search as a primary user interface. One fascinating example of this at Lucene Revolution was ADP’s HR system that forgoes a user interface in its entirety in favor of using search to understand verb/noun pairs of actions, presenting users with search results that reflect actions that can be taken in an HR system. “Hire John Smith” comes back with “Onboard John Smith” as the most relevant action, with perhaps actions for the unfortunately named “John Hire” being less relevant for the search result.

As with Google and Siri, this product reflects our changing expectations when interacting with even everyday applications. We no longer see computing resources as strict executors of specific commands. Rather, we expect fuzzy understanding and inference of what we mean. In other words, we like search interfaces, but expect search to be more than just text search. We want search to be backed my intelligence — machine learning and natural language processing. We want it to be user-centric and focused on our needs.

Once you commit to focusing on search as your user-interface component, you commit yourself to enriching search with other systems. An obvious example of this paradigm is in Big Data. As we believe at OpenSource Connections, search is the most accessible mechanism for working with Big Data. However, once you commit to search as your primary UI component, you must find ways to deal with the results of whatever data science you might be applying to your data set. The “smarts” circle back to the friendly, user-facing search box, creating the richest possible experience for exploring this data.

vectors are fun

An example of LucidWork’s Big Data System. Demonstrating a reference architecture where search is the central UI component to a Big Data set enriched by Machine Learning and OpenNLP

Grant Ingersoll and Ted Dunning presented a reference architecture that captures many of the pieces of this idea. One can use Solr as the primary method of exploring data. Feedback from machine learning/batch processing of data can enrich search results by simply adding/modifying a field. Once in place, features of full-text search can take over.

Solr As The Ideal Data Structure

Why is a search engine like Solr the ideal means of exploring all this data science? Solr has few constraints on what can be indexed. The default is to index everything. Databases (NoSQL or otherwise) however, require us to think carefully about the extremely narrow subset of columns we’re going to choose to index/lookup later. Later if you look up on a non-indexed column, you’ll unknowingly create a performance mess, bogging down the system as the entire column is linearly scanned for the data you’re looking on.

Solr is different. The inverted index data structure is written from an index-first point of view. Fields are frequently even indexed without being stored – the actual storage being unimportant or done elsewhere. The ability to lookup anything clearly allows us to perform full-text search on any field, but in the context of Big Data, faceted navigation is where search really shines. As demonstrated in Trey Grainger’s talk, facets allow a very broad way of breaking down counts of values in a field. This simple tool provides often provide surprising analytic capabilities. Nothing special is required other than the field be indexed – no sweat in Solr.

A simple example of this capability is by enriching Solr documents with clustering algorithms. Once each document has indexed a field indicating its cluster, one can explore the nature of these clusters extremely easily in Solr. First, users view a breakdown of clusters in facets, with their respective counts. Users can then filter by cluster value, viewing how other facets are broken down after the filter. For example, we may discover in our document set that the strongest clusters form around specific natural languages. As we filter on a facet, suddenly only “Chinese” remains in the natural language facet, with other natural languages returning zero documents. Doing a similar thing with a traditional database (SQL or NoSQL) would require many columns be indexed, something seen as onerous to most data modelers.

The Future Of Solr

In his keynote, Yonik pointed out future goals of Solr. With SolrCloud, Solr looks more like a NoSQL solution with search baked into its bones. As Mark Miller said in his talk “Solr started with search and backed into the storage problem; other solutions started with a storage problem and are trying to back into search”. More and more, folks are seeing Solr as a primary data store. It’s friendlier for analysis than databases for most users and is increasingly doing a better job of being a true distributed storage engine.

Enriching the database features of Solr also includes adding more and different types of join functionality. Data can’t always be denormalized without many annoying side-effects and limitations. Solr is well suited to provide very sophisticated joining capabilities across documents, including potentially adding fuzzier/natural-language joins. Perhaps relevancy from the original query could be brought to bear on the boosting of documents from the second query. This is something I talked about in my talk – searching databases of legal jargon with layman’s terms (car ->motor-vehicle), once we’ve arrived at the top 5 most relevant pieces of jargon from a database, we can then search laws with the technical terms that actually exist in the jargony corpus of law.

More and more search is the centerpiece rather than an add-on. And people want more than just search. The future is very exciting as adjacent technologies like machine learning and natural-language processing become more and more regular components of average applications. We’re truly living in a smarter world where search is the ear and the mouth of the computer!

Published at DZone with permission of Doug Turnbull, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)