Chris Hostetter is Senior Staff Engineer at Lucid Imagination, a member of the Apache Software Foundation, and serves as a committer on the Apache Lucene/Solr Projects. Prior to joining Lucid Imagination in 2010 to work full time on Solr development, he spent 11 years as a Principal Software Engineer for CNET Networks thinking about searching “structured data” that was never as structured as it should have been. Chris has posted 15 posts at DZone. You can read more from them at their website. View Full User Profile

Use Prefix Operators instead of Boolean Operators

01.24.2012
| 6229 views |
  • submit to reddit

The following is written with Solr users in mind, but the principles apply to Lucene users as well.

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-”).

Background: Boolean Logic Makes For Terrible Scores

Boolean Algebra is (as my father would put it) “pretty neat stuff” and the world as we know it most certainly wouldn’t exist with out it. But when it comes to building a search engine, boolean logic tends to not be very helpful. Depending on how you look at it, boolean logic is all about truth values and/or set intersections. In either case, there is no concept of “relevancy” — either something is true or it’s false; either it is in a set, or it is not in the set.

When a user is looking for “all documents that contain the word ‘Alligator’” they aren’t going to very be happy if a search system applied simple boolean logic to just identify the unordered set of all matching documents. Instead algorithms like TF/IDF are used to try and identify the ordered list of matching documents, such that the “best” matches come first. Likewise, if a user is looking for “all documents that contain the words ‘Alligator’ or ‘Crocodile’”, a simple boolean logic union of the sets of documents from the individual queries would not generate results as good as a query that took into the TF/IDF scores of the documents for the individual queries, as well as considering which documents matches both queries. (The user is probably more interested in a document that discusses the similarities and differences between Alligators to Crocodiles then in documents that only mention one or the other a great many times).

This brings us to the crux of why I think it’s a bad idea to use the “Boolean Operators” in query strings: because it’s not how the underlying query structures actually work, and it’s not as expressive as the alternative for describing what you want.

BooleanQuery: Great Class, Bad Name

To really understand why the boolean operators are inferior to the prefix operators, you have to start by considering the underlying implementation. The BooleanQuery class is probably one of the most misleading class names in the entire Lucene code base because it doesn’t model simple boolean logic query operations at all. The basic function of a BooleanQuery is:

  1. A BooleanQuery consists of one or more BooleanClauses, each of which contains two pieces of information:


    • A nested Query
    • An Occur flag, which has one of three values


      • MUST – indicating that documents must match this nested Query in order for the document to match the BooleanQuery, and the score from this subquery should contribute to the score for the BooleanQuery
      • MUST_NOT – indicating that documents which match this nested Query are prohibited from matching the BooleanQuery
      • SHOULD – indicating that documents which match this nested Query should have their score from the nested query contribute to the score from the BooleanQuery, but documents can be a match for the BooleanQuery even if they do not match the nested query

  2. If a BooleanQuery contains no MUST BooleanClauses, then a document is only considered a match against the BooleanQuery if one or more of the SHOULD BooleanClauses is a match.
  3. The final score of a document which matches a BooleanQuery is based on the sum of the scores from all the matching MUST and SHOULD BooleanClauses, multiplied by a “coord factor” based on the ratio of the number of matching BooleanClauses to the total number of BooleanClauses in the BooleanQuery.

These rules are not exactly simple to understand. They are certainly more complex then boolean logic truth tables, but that’s because they are more powerful. The examples below show how easy it is to implement “pure” boolean logic with BooleanQuery objects, but they only scratch the surface of what is possible with the BooleanQuery class:

  • Conjunction: (X ∧ Y)

    BooleanQuery q = new BooleanQuery();
    q.add(X, Occur.MUST);
    q.add(Y, Occur.MUST);
    
  • Disjunction: (X ∨ Y)

    BooleanQuery q = new BooleanQuery();
    q.add(X, Occur.SHOULD);
    q.add(Y, Occur.SHOULD);
    
  • Negation: (X ¬ Z)

    BooleanQuery q = new BooleanQuery();
    q.add(X, Occur.SHOULD);
    q.add(Z, Occur.MUST_NOT);
    

Query Parser: Prefix Operators

In the Lucene QueryParser (and all of the other parsers that are based on it, like DisMax and EDisMax) the “prefix” operators “+” and “-” map directly to the Occur.MUST and Occur.MUST_NOT flags, while the absence of a prefix maps to the Occur.SHOULD flag by default. (If you have any suggestions for a one character prefix syntax that could be used to explicitly indicate Occur.SHOULD, please comment with your suggestions, I’ve been trying to think of a good one for years.) So using the prefix syntax, you can express all of the permutations that the BooleanQuery class supports — not just simple boolean logic:

  • (+X +Y) … Conjunction, ie: (X ∧ Y)

    BooleanQuery q = new BooleanQuery();
    q.add(X, Occur.MUST);
    q.add(Y, Occur.MUST);
    
  • (X Y) … Disjunction, ie: (X ∨ Y)

    BooleanQuery q = new BooleanQuery();
    q.add(X, Occur.SHOULD);
    q.add(Y, Occur.SHOULD);
    
  • (+X -Z) … Negation, ie: (X ¬ Z)

    BooleanQuery q = new BooleanQuery();
    q.add(X, Occur.MUST);
    q.add(Z, Occur.MUST_NOT);
    
  • ((X Y) -Z) … ((X ∨ Y) ¬ Z)

    BooleanQuery q = new BooleanQuery();
    BooleanQuery inner = new BooleanQuery();
    inner.add(X, Occur.SHOULD);
    inner.add(Y, Occur.SHOULD);
    q.add(inner, Occur.SHOULD);
    q.add(Z, Occur.MUST_NOT);
    
  • (X Y -Z) … Not expressible in simple boolean logic

    BooleanQuery q = new BooleanQuery();
    q.add(X, Occur.SHOULD);
    q.add(Y, Occur.SHOULD);
    q.add(Z, Occur.MUST_NOT);
    

Note in particular the differences between the last two examples. (X Y -Z) is a single BooleanQuery object containing three clauses, while ((X Y) -Z) is a BooleanQuery containing two clauses, one of which is a nested BooleanQuery containing two clauses. In both cases a document must match either “X” or “Y” and it can not match against “Z” (so the set of documents matched by each query will be the same) and in both cases the score of a document will be higher if it matches both the “X” and “Y” clauses; but because of the difference in their structures, the scores for these queries will be different for every document. In particular, the coord factor will cause documents matching only one of “X” or “Y” (but not both) to have extremely different scores from each of these queries. (This assumes that the DefaultSimilarity is being used; it would be possible to write a custom Similarity to force the scores to be equivalent)

Query Parser: “Boolean Operators”

The query parser also supports the so called “boolean operators” which can also be used to express boolean logic, as demonstrated in these examples:

  • (X AND Y) … Conjunction, ie: (X ∧ Y)

    BooleanQuery q = new BooleanQuery();
    q.add(X, Occur.MUST);
    q.add(Y, Occur.MUST);
    
  • (X OR Y) … Disjunction, ie: (X ∨ Y)

    BooleanQuery q = new BooleanQuery();
    q.add(X, Occur.SHOULD);
    q.add(Y, Occur.SHOULD);
    
  • (X NOT Z) … Negation, ie: (X ¬ Z)

    BooleanQuery q = new BooleanQuery();
    q.add(X, Occur.SHOULD);
    q.add(Z, Occur.MUST_NOT);
    
  • ((X AND Y) OR Z) … ((X ∧ Y) ∨ Z)

    BooleanQuery q = new BooleanQuery();
    BooleanQuery inner = new BooleanQuery();
    inner.add(X, Occur.MUST);
    inner.add(Y, Occur.MUST);
    q.add(inner, Occur.SHOULD);
    q.add(Z, Occur.SHOULD);
    
  • ((X OR Y) AND Z) … ((X ∨ Y) ∧ Z)

    BooleanQuery q = new BooleanQuery();
    BooleanQuery inner = new BooleanQuery();
    inner.add(X, Occur.SHOULD);
    inner.add(Y, Occur.SHOULD);
    q.add(inner, Occur.MUST);
    q.add(Z, Occur.MUST);
    
  • (X AND (Y NOT Z)) … (X ∧ (Y ¬ Z))

    BooleanQuery q = new BooleanQuery();
    BooleanQuery inner = new BooleanQuery();
    inner.add(Y, Occur.MUST);
    inner.add(Z, Occur.MUST_NOT);
    q.add(X, Occur.MUST);
    q.add(inner, Occur.MUST);
    

Please note how import it is to use parentheses to combine multiple operators in order in order to generate queries that correctly model boolean logic. As mentioned before, the BooleanQuery class supports an arbitrary number of clauses, so (X OR Y OR Z) is a single BooleanQuery with three clauses — it is not equivalent to either ((X OR Y) OR Z) or (X OR (Y OR Z)) because those result in a BooleanQuery with two clauses, one of which is a nested BooleanQuery. As mentioned above when discussing the prefix operators, the scores from each of those queries will all be different depending on which clauses are matched by each document.

Things definitely get very confusing when these “boolean operators” are used in ways other then those described above. In some cases this is because the query parser is trying to be forgiving about “natural language” style usage of operators that many boolean logic systems would consider a parse error. In other cases, the behavior is bizarrely esoteric:

  • Queries are parsed left to right
  • NOT sets the Occurs flag of the clause to it’s right to MUST_NOT
  • AND will change the Occurs flag of the clause to it’s left to MUST unless it has already been set to MUST_NOT
  • AND sets the Occurs flag of the clause to it’s right to MUST
  • If the default operator of the query parser has been set to “And”: OR will change the Occurs flag of the clause to it’s left to SHOULD unless it has already been set to MUST_NOT
  • OR sets the Occurs flag of the clause to it’s right to SHOULD

Practically speaking this means that NOT takes precedence over AND which takes precedence over OR — but only if the default operator for the query parser has not been changed from the default (“Or”). If the default operator is set to “And” then the behavior is just plain weird.

In Conclusion

I won’t try to defend or justify the way the query parser behaves when it encounters these “boolean operators”, because in many cases I don’t understand or agree with the behavior myself — but that’s not really the point of this article. My goal isn’t to convince you that the behavior of these operators makes sense, quite the contrary my goal today is to point out that regardless of how these operators are parsed, they aren’t a good representation of the underlying functionality available in the BooleanQuery class.

Do yourself a favor and start thinking about BooleanQuery as a container of arbitrary nested queries annotated with MUST, MUST_NOT, or SHOULD and discover the power that is available to you beyond simple boolean logic.

Many thanks to Bill Dueber whose recent related blog post reminded me that I had some draft notes on this subject floating around my laptop waiting to finished up and posted online.

Source:  http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/


Published at DZone with permission of its author, Chris Hostetter.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Afandi Merathi replied on Fri, 2012/03/16 - 11:57am

Very provocative article, and I’m experiencing the problems you describe with AND and OR. I’m using solrj in my application, though, and don’t understand how to develop and integrate a set of BooleanQuery objects into the SolrQuery. I CAN make changes to surround sets of fields with parentheses – is this the “answer” for users of solrj?

Thank you for your insights and guidance.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.