Big Data/Analytics Zone is brought to you in partnership with:

Coming from a background of Aerospace Engineering, John soon discovered that his true interest lay at the intersections of information technology and entrepreneurship (and when applicable - math). In early 2011, John stepped away from his day job to take up software consulting. Finally John found permanent employment at Opensource Connections where he currently consults large enterprises about full-text search and Big Data applications. Highlights to this point have included prototyping the future of search with the US Patent and Trademark Office, implementing the search syntax used by patent examiners, and building a Solr search relevancy tuning framework called SolrPanl. John is a DZone MVB and is not an employee of DZone and has posted 23 posts at DZone. You can read more from them at their website. View Full User Profile

The Anatomy Of A DisMax Query

03.09.2013
| 2381 views |
  • submit to reddit

While debugging the functionality of a new query parser, I had the poor fortune rare opportunity to dig deep into Solr’s search scoring code. I ended up learning a ton about how queries are composed and scored, so I though I’d pass the learning on to you so that perhaps you can avoid my fate benefit from my new-found understanding.

Our discussion here will focus upon the composition of queries made through Solr’s DisMax query parser (DisMaxQParser). This was the code that I was largely working with, but it serves as an excellent example because it touches upon all of the most important query types that you might regularly deal with: Boolean queries, DisMax queries, Phrase queries, and Term queries.

The Query

So without further adieu let’s take a look our sample query:

localhost:8983/solr/select
?q=captain james kirk
&defType=edismax
&qf=BodyTitle^1.5
&pf=BodyTitle
&mm=2
&debugQuery=true

(Those who have been following our blog recently may realize that here we are using our favorite demo data set – the SciFi StackExchange data. It’s easy to understand and fun to work with. If you’d like to give it a try, our Indexing StackOverflow in Solr post will get you started in 10 minutes.) Here we’re looking for posts containing captain james kirk (q), we’re using the edismax parser (defType), we’re looking for pure term matches in the Body and Title fields (note the boost on the Title, qf), and we’re looking for phrases matches on the same two fields (pf). We insist that at least 2 of the terms must match (mm), and finally, since we’re interested in understanding how Solr thinks about our query and about scoring, we turn on debug output (debugQuery).

The first thing that is interesting to look at is the parsedquery section of the response. This shows us how Solr breaks down that query we sent it:

+(
  ( DisjunctionMaxQuery((Body:captain | Title:captain^1.5))
    DisjunctionMaxQuery((Body:james | Title:james^1.5))
    DisjunctionMaxQuery((Body:kirk | Title:kirk^1.5)) )~2
)
DisjunctionMaxQuery((Body:"captain james kirk"))
DisjunctionMaxQuery((Title:"captain james kirk"))

Boolean Queries

Here we actually have examples of all the commonly used queries, Boolean, DisjunctionMax, Term, and Phrase. Let’s break it down. At the outer-most level we have a Boolean query with three clauses:

  1. +(( DisjunctionMaxQuery((Body:captain | Title:captain^1.5)) DisjunctionMaxQuery((Body:james | Title:james^1.5)) DisjunctionMaxQuery((Body:kirk | Title:kirk^1.5)) )~2)
  2. DisjunctionMaxQuery((Body:"captain james kirk"))
  3. DisjunctionMaxQuery((Title:"captain james kirk"))

Now you might be asking yourself “Where are the ANDs and ORs?” Well, that’s not quite how Lucene Boolean queries work. Instead, a Lucene Boolean query is composed of a set of clauses that are modified by either MUST match,MUST_NOT match, or SHOULD match. SHOULD match is the default, MUST match is indicated in the above query by a plus at the front of the clause, and MUST_NOT match, though not present in this example, is indicated by a minus at the front of the clause. So for this query, we are insisting that the first clause MUST match in every document returned from our search, the other two SHOULD match, but if they don’t it’s ok (the document just gets a lower score). And as far as scoring goes, each matching clause of a Boolean query is scored and the final score of the Boolean query is the sum of the score of its clauses.

Let’s dig one level deeper on that first clause of the Boolean query. Here we actually find yet another Boolean query, and it has three SHOULD match clauses (these clauses are all DisjunctionMaxQueries… hold your horses, we’re getting there). One interesting thing to note about this Boolean query is the ~2 at the end of that query. It’s there because you earlier specified mm=2 in the original query. This ~2 indicates that of the three optional SHOULD match subqueries, we are insisting that at least two of these queries actually match, otherwise we disregard that document.

DisMax Queries

Let’s now take a closer look at the first actual DisMax query:DisjunctionMaxQuery((Body:captain | Title:captain^1.5)). DisjunctionMax is a funny name… but no one’s laughing. It means that we should search for the term in question across the disjunction of the possible fields, and return themax score from all these sub-queries. So in this case, if captain matched in the Title and matched in Body, we would find the score in both cases and return the maximum of these two as the score for the DisMax clause

Now why would you want to return only the max of the sub-query scores? Consider an example: Let’s say you want to search for harry potter over the Title and Body fields. If we take the sum of the subqueries instead of the max, then a document that has “Harry Potter” in the Title but not in the Body will have the same score as a document that happens to have “Harry” in both Title and Body, but “Potter” nowhere. This is not what you want! If maximization still seems too harsh for your use case, then look up the “tie” parameter which allows you to do something between maximization and summing of clause scores.

Term Queries

Going one level deeper now, let’s look at the first clause: Body:captain. Looks pretty simple, right? Were just looking for captain in the Body field. Actually, this is one of the most complex portions of the query. The score of this query is dependent upon several thing, among them:

  • How common the term captain is in this document (Term Frequency)
  • How common the term captain is across all documents (Document Frequency)
  • How large the field is in this document (Field Norm)
  • How much this query is boosted

This is actually a very interesting topic, and possibly worth a future post by itself. However, for the time being, I will just refer you to the very well documented TFIDFSimilarity class that covers these details.

Phrase Queries

This leaves one final piece, the Phrase query. Traversing back up the query parse tree, the next top level Boolean sub-clause is a DisMax query with a single clause: DisjunctionMaxQuery((Body:"captain james kirk")). As a DisMax query, this is a somewhat degenerate case because, as you’ll recall, DisMax takes the maximum score of all its sub-clauses; in this case there is only one clause. Regardless, it’s the sub-clause that we’re interested in right now:Body:"captain james kirk"; This is a Phrase query. Basically it seeks all the documents that contain all of the specified terms and then throws away the documents in which these terms are not adjacent to one another and in this order. Scoring proceeds in a manner analogous to Term queries in that the score is proportional to the number of occurrence of the phrase in this document and inversely proportional to the number of occurrences of the these terms across the whole corpus.

Holistic Understanding Of The DisMax Query

So now since we’ve broken down the DisMax query and taken a look at all the little pieces, let’s build it back up and look once more at the DisMax query as a whole.

+(
  ( DisjunctionMaxQuery((Body:captain | Title:captain^1.5))
    DisjunctionMaxQuery((Body:james | Title:james^1.5))
    DisjunctionMaxQuery((Body:kirk | Title:kirk^1.5)) )~2
)
DisjunctionMaxQuery((Body:"captain james kirk"))
DisjunctionMaxQuery((Title:"captain james kirk"))

In English:

  • We’re looking first and foremost for documents that contain at least two of the three terms: captainjames, and kirk. This has to match!
    • We are searching only in the Body and Title fields, and documents don’t get a higher score for matching any one of these terms in both fields.
      • Oh… and Title is really important, so its score is multiplied by 1.5.
  • Next if any of the documents from the last bullet also have the phrase “captain james kirk” in the Body, then we’ll give them a higher score.
  • Again, if any of the documents from the last bullet also have the phrase “captain james kirk” in the Title, then we’ll give them a higher score.

And that’s it! It’s a little confusing at first, but once you understand what’s going on, you’ll be better able to control the relevancy of your search results. Next up, take a look at a related post where I talk about finally killing Sea Biscuit!

Published at DZone with permission of John Berryman, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)