Big Data/Analytics Zone is brought to you in partnership with:

Doug has been engrossed in programming since his parents first bought him an Apple IIe computer in 4th grade. Throughout his early career, Doug proved his flexibility and ingenuity in crafting solutions in a variety of environments. Doug’s most recent work has been in the telecom industry developing tools to analyze large amounts of network traffic using C++ and Python. Doug loves learning and synthesizing this knowledge into code and blog articles. Doug is a DZone MVB and is not an employee of DZone and has posted 36 posts at DZone. You can read more from them at their website. View Full User Profile

Async Solr Queries in Python

11.04.2013
| 3016 views |
  • submit to reddit

I frequently hit the wall of needing to work asynchronously with Solr requests in Python. I’ll have some code that blocks on a Solr HTTP request, waits for it to complete, then execute a second request. Something like this code:

import requests

#Search 1
solrResp = requests.get('http://mysolr.com/solr/statedecoded/search?q=law')

for doc in solrResp.json()['response']['docs']:
    print doc['catch_line']

#Search 2
solrResp = requests.get('http://mysolr.com/solr/statedecoded/search?q=shoplifting')

for doc in solrResp.json()['response']['docs']:
    print doc['catch_line']

(we’re using the Requests library to do HTTP):

Being able to parallelize work is especially helpful with scripts that index documents into Solr. I need to scale my work up so that Solr, not network access, is the indexing bottleneck.

Unfortunately, Python isn’t exactly JavaScript or Go when it comes to doing asynchronous programming. But the gevent coroutine library can help us a bit with that. Under the hood, gevent uses the libevent library. Built on top of native async calls (select, poll, etc — the original async), libevent nicely leverages a lot of low-level async functionality.

Working with gevent is fairly straightforward. One slight sticking point is the gevent.monkey.patch_all() which patches a lot of the standard library to cooperate better with gevent’s asychrony. It sounds scary, but I have yet to have a problem with the monkey patched implementations.

Without further ado, here’s how you use gevents to do parallel Solr requests:

import requests
from gevent import monkey
import gevent
monkey.patch_all()


class Searcher(object):
    """ Simple wrapper for doing a search and collecting the
        results """
    def __init__(self, searchUrl):
        self.searchUrl = searchUrl

    def search(self):
        solrResp = requests.get(self.searchUrl)
        self.docs = solrResp.json()['response']['docs']


def searchMultiple(urls):
    """ Use gevent to execute the passed in urls;
        dump the results"""
    searchers = [Searcher(url) for url in urls]

    # Gather a handle for each task
    handles = []
    for searcher in searchers:
        handles.append(gevent.spawn(searcher.search))

    # Block until all work is done
    gevent.joinall(handles)

    # Dump the results
    for searcher in searchers:
        print "Search Results for %s" % searcher.searchUrl
        for doc in searcher.docs:
            print doc['catch_line']

searchUrls = ['http://mysolr.com/solr/statedecoded/search?q=law', 
              'http://mysolr.com/solr/statedecoded/search?q=shoplifting']

searchMultiple(searchUrls)

Lots more code, and not nearly as pretty as the equivalent JavaScript, but it gets the job done. The meat of the code is these lines:

# Gather a handle for each task
handles = []
for searcher in searchers:
    handles.append(gevent.spawn(searcher.search))

# Block until all work is done
gevent.joinall(handles)

We tell gevent to spawn searcher.search. This gives us a handle to the spawned task. We can then optionally wait for all the spawned tasks to complete, then dump the results.

That’s about it! As always, comment if you have any thoughts on pointers. And let us know how we can help with any part of your Solr search application!


Published at DZone with permission of Doug Turnbull, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)