Big Data/Analytics Zone is brought to you in partnership with:

Erick is a Solr/Lucene committer who has worked in the IT field for 30 years or so. He holds a MS from the University of Michigan and works for Lucid Imagination. He's envious of the kids these days whose workday isn't spent chasing down wild pointers in "C" like he had to when he started programming, and yearns for the days when the entire programming language was contained in a slim, trim, little book (K&R, 274 pages, small form-factor) rather than the shelf-and-a-half of 8 1/2" x 11" tomes that he has to deal with now. Erick has posted 6 posts at DZone. You can read more from them at their website. View Full User Profile

Solr: Experimenting with Join Performance

06.25.2012
| 8664 views |
  • submit to reddit

We recently had a client who wanted some up-front sense of how Solr joins performed. Naturally, the client wanted to use joins in the most painful way, so I set out to make a prototype. Of course I ran into some issues, but one of the delights of working for Lucid is that I have ready access to many of the people who wrote the code, something to treasure! Being able to access these folks makes me look waaaay smarter than I am….

Anyway, on my 2009 Macbook pro I ran some rather unscientific experiments, but enough to give me a sense of joins in one particular case. I’ll outline what I did and what the results were.

The setup

For this experiment, I created an index consisting of 26M documents. They were divided up into groups, one text document and 5 metadata documents. The text document contained 1K of semi-random English words (just chosen from “some list I got from the internet”). Semi-random because I weighted them a bit to have more common words than rare words, but it turns out that the searching part of the process isn’t where the time is spent so we can pretty much ignore that.

There are 5 metadata documents related to the text doc by Solr’s <uniqueKey>. Think of this as the metadata documents having a foreign key into the text doc <uniqueKey>. The metadata documents also had an integer field in the range 0-10,000. The whole purpose of this setup was to form queries that returned the text docs for which a metadata doc existed granting access. The complexity of granting access is…er…low, I just did a range query. “Not realistic” you say. You’re right. I didn’t want any complex processing to get in the way of looking at joins, so I kept all this simple.

The form of the join query was:

 q=text_all:(1 to 3 random words)&fl=id,score&sort=score desc&fq={!join from=join_id to=id}access:[7434 TO 7514]

See: http://wiki.apache.org/solr/Join. I wrote a little harness to fire off HTTP queries at the instance of Solr (4.x from a couple of months back). I could configure the number of simultaneous threads firing off queries. Note that I was testing this form because it applied to the customer, but I suspect that the other forms have the same issue.

Small dislcaimer

As I mentioned, one of the pleasures of working for Lucid is having access to people who deeply understand the code. So I chatted with the join author (Yonik Seeley) and discovered, of course, that the scenario I was testing was the worst performance wise. Joins are O(num_terms_in_fields), and using the <uniqueKey> as my join field guarantees that there are lots and lots and lots of terms. So these results are worst-case. Unfortunately, they’re also one of the most common.

Threads Queries (total) Avg QTime(seconds) Elapsed Clocktime Queries/second
1 20 4.9 98 0.2
2 40 5.9 123 0.3
5 100 15.3 310 0.3
10 200 31.5 649 0.3

 

A note about these rather counter-intuitive numbers. Once the CPU maxes out, the QTime starts to increase, but the QPS rate stays rather constant. On a dual-core machine, we see that with 2 threads. The 5 and 10 thread (client) rows simply show that each individual request takes longer, end-to-end, but there are more queries being served by Solr simultaneously.

When I took the join part out, performance went up about 15x. I was monitoring the CPU, and it was pegged with 2 threads, which makes sense. I had jConsole running and didn’t see any anything odd with memory/garbage collection, but it was just a cursory examination.

But what does it mean?

Well, the take-away is that you really, really should experiment with the join performance in your situation before deciding on it as a solution for all your problems. I’d expect the numbers to be much better for fields with fewer unique values. But Solr makes a lousy RDBMS, and every time you think of using it as one, you should make an effort to re-think your problem in a way that doesn’t try to make Solr behave as one. These numbers, assuming that they are representative of your particular situation could well be killers. On the other hand, they may be fine if your particular situation is serving a small community of users for whom the time spent waiting for a query to return is well-spent. It Depends ™.

It might also mean that the case that Solr join functionality was meant to solve takes an unnecessarily restrictive approach for this particular problem. I suspect it’s quite possible that specializing the join code for the to-id was a <uniqueKey> might change the performance radically. One of the characteristics of open-source code is that solutions for the immediate problem get implemented and then refined for other cases as necessary.

“Rethinking” often involves at least three phases.

  1. Think hard about the problem. Can it be solved by clever indexing? DB folks really don’t like to flatten data, but that’s often a viable approach.
  2. Ask if the functionality is really and truly something that’ll help the user experience. Often faceting and filtering will be “good enough”. There’s nothing particularly “natural” about RDBMS concepts as far as your users are concerned, so ask your UI design experts what would really help the user.
  3. Prototype in your situation and talk to your product managers before irrevocably deciding to go down this route. The eXtreme Programming people emphasize over and over that making your PMs aware of the costs of a feature they’re insisting on will help them make better decisions. Asking “What you want will require 5 times as much hardware and take an additional month to implement, will XXX be good enough?” gives them some information.
  4. Ask “Is Solr the right solution?”. I love Solr/Lucene. Working with this ecosystem pays my bills. I admire the work that people do in the nitty-gritty parts of the code. But Solr and Lucene aren’t suited for some tasks. It may be that the problem you’re trying to solve would be better served by an RDBMS. It may be that some kind of hybrid between Solr and an <insert your favorite solution here> works better. It may be that Solr shouldn’t be part of the solution to this problem at all. Not all nails should be driven with Solr.

As I understand it, this behavior is inherent in how the join code is implemented and the number of matching documents isn’t the limiting factor (and this was borne out in my experiments). I wonder if one could make use of the fact that the join field is a <uniqueKey> to implement a specialization. Hmmm, I’ll have to talk to Yonik, but I suspect it’s one of those things that seems simple but quickly gets untenable. And, here we go again trying to make Solr behave like a DB…..

You can still use joins!

Don’t interpret this as saying “Don’t use joins”. Rather, you should be aware that they were implemented to solve a specific problem, not the general many-to-many relationship. The algorithm does what it needs to do to solve that problem, but when applied to different problems may not be performant enough to apply to your situation. Test, test, test!!!

 

Published at DZone with permission of its author, Erick Erickson. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)