Big Data/Analytics Zone is brought to you in partnership with:

Doug has been engrossed in programming since his parents first bought him an Apple IIe computer in 4th grade. Throughout his early career, Doug proved his flexibility and ingenuity in crafting solutions in a variety of environments. Doug’s most recent work has been in the telecom industry developing tools to analyze large amounts of network traffic using C++ and Python. Doug loves learning and synthesizing this knowledge into code and blog articles. Doug is a DZone MVB and is not an employee of DZone and has posted 33 posts at DZone. You can read more from them at their website. View Full User Profile

Querying More Fields != More Results

04.18.2013
| 2881 views |
  • submit to reddit

Let’s recall from Anatomy of a Dismax Query some key components to the dismax query parser:

  • qf – the fields we will search over (we’ll take the highest score out of all the fields that match)
  • mm – the minimum number of fields that MUST match the query

OK, now we’ve had plenty of time to study John’s post (and hey you should be able to even debug Solr). Let’s take our new knowledge for a test drive with this puzzler: Why would adding a field to qf cause our result set to actually shrink in size? Consider these two Solr queries:

(A) http://localhost:8983/solr/select?
    q=captain+of+enterprise&qf=body&mm=3&defType=dismax

(B) http://localhost:8983/solr/select?
    q=captain+of+enterprise&qf=title+body&mm=3&defType=dismax

The only difference between A and B is qf. Query B adds “title” to qf.

Why would query A return more results than query B? In query B we added a field, so shouldn’t there be more fields to match on and therefore more documents? Not necessarily as it turns out. Why? Well let’s start with something that might help us solve this problem: body is stop worded at query time. Title is not. Well let’s dig a little deeper. Let’s set debugQuery=true to take a gander at what’s happening under the hood with query parsing & analysis. When we dig into query parsing, A and B turn into the following two dismax queries:

(A)+((body:captain body:enterprise)~2)(B)+(((title:captain | body:captain)(title:of)(title:enterprise | body:enterprise))~3)

Notice how in both cases body’s stop wording has removed our search for “of” in the body field. In query A this reduces mm to 2, as dismax nicely figures out that after stop wording, we only have 2 clauses in our query – “body:captain” and “body:enterprise”.

What has the addition of an extra field done in B? Well notice it’s introduced a 3rd clause between “captain” and “enterprise”. Query-time stop wording has removed “body:of”. However title is not stop worded. Therefore, Solr can still potentially match on “title:of” so this component of the middle clause stays in place.

The result of query parsing is that now we have a mandatory clause requiring title to have “of”. Therefore, the result set for query B is limited to the number of titles that have “of” in them. If no titles have “of” in them, then we’ll get no results.

This sounds like an unlikely scenario, but consider if instead of “title” you have another field. Something with a very tightly controlled vocabulary. Something like, titles of laws. Then you could hit this problem very easily.

Solutions?

It’s a bit hard to figure out what’s expected of Solr in this case. Should the “title:of” query be mandatory? Should it be coupled with a “body:?” clause that will match on any term in body (effectively letting body off the hook?).

As a user, it doesn’t seem to make sense to avoid stop wording entirely just to avoid this behavior. It’s a useful tool. More importantly, I feel that we probably want dismax to continue to be able to search over heterogeneous fields with their own analysis chains. Why should the behavior of dismax constrain how we decided to slice up individual fields?

Nevertheless, one takeaway is clean – don’t get aggressive with mm. Think carefully about mm in terms of the percentage of stopwords you’ll likely encounter – realizing that might upgrade some parts of the dismax query to even more mandatory than they are. For long queries q=Where in the world is Carmen Sandiego? this could be quite a few stopwords. For short queries, you’re likely to encounter few stop words like in the query q=Carmen Sandiego. Luckily Solr lets us control mm as a function of the number of clauses in the query.

I’d love to get your thoughts though. Have you encountered this issue before in the wild? How have you solved it?

Published at DZone with permission of Doug Turnbull, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)