Big Data/Analytics Zone is brought to you in partnership with:

Doug has been engrossed in programming since his parents first bought him an Apple IIe computer in 4th grade. Throughout his early career, Doug proved his flexibility and ingenuity in crafting solutions in a variety of environments. Doug’s most recent work has been in the telecom industry developing tools to analyze large amounts of network traffic using C++ and Python. Doug loves learning and synthesizing this knowledge into code and blog articles. Doug is a DZone MVB and is not an employee of DZone and has posted 36 posts at DZone. You can read more from them at their website. View Full User Profile

Improve Search Relevance by Telling Solr Exactly What You Want

07.24.2013
| 2669 views |
  • submit to reddit

To be successful, (e)dismax relies on avoiding a tricky problem with its scoring strategy. As we’ve discussed, dismax scores documents by taking the maximum score of all the fields that match a query. This is problematic as one field’s scores can’t easily be related to another’s. A good “text” match might have a score of 2, while a bad “title” score might be 10. Dismax doesn’t have a notion that “10” is bad for title, it only knows 10 > 2, so title matches dominate the final search results.

vectors are fun

Please find my piece of hay!

The best case for dismax is that there’s only one field that matches a query, so the resulting scoring reflects the consistency within that field. In short, dismax thrives with needle-in-a-haystack problems and does poorly with hay-in-a-haystack problems.

We need a different strategy for documents that have fields with a large amount of overlap. We’re trying to tell the difference between very similar pieces of hay. The task is similar to needing to find a good candidate for a job. If we wanted to query a search index of job candidates for “Solr Java Developer”, we’ll clearly match many different sections of our candidates’ resumes. Because of problems with dismax, we may end up with search results heavily sorted on the “objective” field. Our top scoring result might have something like:

Goal: Work with Solr some day!

Clearly not what we want! We need the hardcore experienced folks!

I’ve switched to using a different strategy for search relevancy in these kinds of cases. Start with rudimentary yet simple scoring avoiding the wild swings of dismax. Once this is in place, give Solr a list of additive queries (via bq/bf) that describe the ideal document. Tune the multiplier on each qualification through testing and experimentation.

Simple Base Scoring

Instead of relying on qf/pf to search and take the best of multiple fields, I’ll create a grab-bag field. I’ll use Solr’s copyField directives to copy all text I want to match on into this field in the schema:

<copyFieldsource=”resume_goal”dest=”text_all”/><copyFieldsource=”resume_experience”dest=”text_all”/><copyFieldsource=”resume_skills”dest=”text_all”/>

The field “text_all” becomes what Solr initially searches. The assumption here is that it’s appropriate to tokenize what goes into text_all the same way. In this kind of setup, you might also want to consider omitTermFreqsAndPositions for text_all, otherwise your scoring will be heavily biased toward the field that contributes the most tokens to text_all.

Now we can set

qf=text_all

and start searching!

Describe Job Qualification To Solr

Once there’s baseline, predictable scoring in place, let’s describe our ideal candidate by passing solr multiple boost queries that help bubble up the the best documents for the problem we’re trying to solve:

  1. The candidate has at least 75% of the required skills

    bq={!edismax qf=resume_skills mm=75% v=$q bq=}

  2. The candidate wants to work with the technology

    bq={!edismax qf=resume_goals v=$q bq=}

  3. The candidate has a high StackOverflow reputation

    bf=log(resume_stackoverflow_reputation)

Each of these queries lets Solr layer in an extra factor into the sorting. Notice how in the bq we set v=$q. We’re using Solr’s local param syntax to reprocess the original query against a new set of criteria. We’re also making an assumption in the first bq that resume_skills will utilize an analysis chain that will filter out tokens that are non-job skills through a combination of synonyms and filtering. It’s also important to note that this wouldn’t be the finished product. Each boost needs to be carefully tuned through testing, tweaking its impact with the ^(multiplier) syntax.

vectors are fun

Which one of you is the perfect document for this query?

One nice thing about this strategy is we’re directly telling Solr exactly what we want in an awesome candidate. It’s a bit like using Solr for a fuzzy sorter, explicitly feeding it pieces of criteria we think are “good”, tuning those criteria, then using it to find the answers that match as many pieces of criteria as we specify. It’s also easy to decide later that we want to layer on additional criteria (does the candidate have code on github that utilizes skills in the query? – how much code? – how recent is it?). We could even apply additional queries based on additional criteria like salary requirements. It’s a pretty exciting strategy. John Berryman and I have even been wondering whether this might help get at his multiple objective scoring ideas. In any case, I hope to be using it more!

Let us know what you think of this strategy! If you’ve got a tough relevancy problem, let us know, we’ve got this and plenty other relevancy tricks up our sleeves and we’d love to talk with you!

- See more at: http://www.opensourceconnections.com/2013/07/21/improve-search-relevancy-by-telling-solr-exactly-what-you-want/#sthash.7138KIE7.dpuf

Published at DZone with permission of Doug Turnbull, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)