Big Data/BI Zone is brought to you in partnership with:

Coming from a background of Aerospace Engineering, John soon discovered that his true interest lay at the intersections of information technology and entrepreneurship (and when applicable - math). In early 2011, John stepped away from his day job to take up software consulting. Finally John found permanent employment at Opensource Connections where he currently consults large enterprises about full-text search and Big Data applications. Highlights to this point have included prototyping the future of search with the US Patent and Trademark Office, implementing the search syntax used by patent examiners, and building a Solr search relevancy tuning framework called SolrPanl. John is a DZone MVB and is not an employee of DZone and has posted 23 posts at DZone. You can read more from them at their website. View Full User Profile

Solr Finds the Best Time to Post Questions on StackOverflow

03.24.2013
| 3613 views |
  • submit to reddit

So let’s say that you have an important tech question that simply must be answered:

“What’s the difference between JavaScript and Java?”

Normally you would post it on StackOverflow and add a hefty bounty to get it answered fast. But, you’ve posted a bounty on the past 10 questions and now your Stack Overflow reputation is 4.

Don’t fret, perhaps if you just time your question correctly you can catch all those Java/JavaScript programmers right when they’re answering important questions like yours. And how do you figure out just when that magic time is? Simple, you index the entire StackOverflow data dump into Solr and treat Solr as a StackOverflow analytics engine. (Hey, you may not know the difference between Java and JavaScript, but you’re nobody’s fool when it comes to Solr!)

So here’s what this looks like. The post.xml file in the StackOverflow data dump contains all the questions and answers on the site. Posts contain the following fields:

  • Id – Unique id for a question or answer.
  • ParentId – If this post is an answer, ParentId refers to the corresponding question.
  • PostTypeId – 1 for a question, 2 for an answer.
  • CreationDate – In Greenwich Mean Time.
  • Body – The contents of the post.
  • Title – You guessed it.
  • Tags – A list of topics for this question.

In order to slice and dice the data to find the best time of year, day of week, or time of day to answer a question it’s a good idea to break up the CreationDate into a set of related fields:

  • CreationMonth – 1 through 12.
  • CreationHour – 0 through 23.
  • CreationMinute – 0 through 60.
  • CreationDayOfWeek – 0 (Monday) through 6.
  • CreationDayOfYear – 1 through 365.

Now all you have to do to find out that golden time for asking a question is to find the times when most people are answering questions about Java AND Javascript.

http://localhost:8983/solr/collection1/select?q=Tags:(java AND javascript)&fq=PostTypeId:2&facet=on
&facet.field=CreationDayOfYear&f.CreationDayOfYear.facet.limit=365&facet.field=CreationDayOfWeek&facet.field=CreationHour&facet.sort=index

In words, the query q is for all questions tagged with both java and javascript. These results are filtered fq so that only answers are returned. The remainder of the parameters turn on sorted facet lists for times of the year, week, and day. So, as soon as you get query Solr, you’ll know the best times of the year, week, and day to ask your questions. You press enter and SNAP no results! What gives?!

After a little research it turns out that only questions (PostTypeId=1) have theTags field – so obviously you can not get a count of the answers tagged withJava AND JavaScript. So are you sunk? Is there no way to find out when the Java/JavaScript questions are getting all the attention? Are you going to have to do some crazy MapReduce indexing job to associate answers with their corresponding tags? It turns out no!

Solr Join To The Rescue

That’s right, Solr’s Join functionality is a perfect fit for this particular problem. Let’s take a look at how this would work:

http://localhost:8983/solr/collection1/select
q={!join from=Id to=ParentId}Tags:(java AND javascript)&facet=on
&fq=PostTypeId:2&facet.field=CreationDayOfYear&f.CreationDayOfYear.facet.limit=365&facet.field=CreationDayOfWeek&facet.field=CreationDayOHour&facet.sort=index

As you can see, the only difference here is strange notation at the front of theq parameter.

{!join from=Id to=ParentId}

That is Solr’s local parameter notation, and here’s what it’s telling Solr to do: First you have join; this is actually syntactic sugar for the first parameter only. It’s the same thing as saying type=join. This means that instead of using the lucene or dismax query mode, we will be using the join query mode. Next we have from=Id. To put this in SQL terms, this means that we will be be usingId as the primary key. Finally we have to=ParentId which, as you might have guessed, implies that ParentId will be used as the foreign key.

When we issue the query, Solr first retrieves a list of documents matching the query Tags:(java AND javascript). Then, for every document in that result set, Solr retrieves the set of documents that have a ParentId corresponding to theIds in the original set.

In SQL world, this query would look like this:

SELECT * 
FROM collection1
WHERE ParentId IN (SELECT Id FROM collection1 whereTag="(Java and Javascript")

And now as soon as you you issue the query, you get the following Solr response:

<lstname="CreationDayOfYear"><intname="1">2</int><intname="2">4</int><intname="3">12</int><intname="4">5</int><intname="5">10</int><intname="6">8</int><intname="7">2</int><!--snip--><intname="362">9</int><intname="363">3</int><intname="364">4</int><intname="365">5</int></lst><lstname="CreationHour"><intname="1">63</int><intname="2">44</int><intname="3">122</int><intname="4">65</int><intname="5">120</int><intname="6">48</int><intname="7">62</int><!--snip--><intname="21">29</int><intname="22">63</int><intname="23">434</int></lst><lstname="CreationDayOfWeek"><intname="0">371</int><intname="1">390</int><intname="2">383</int><intname="3">422</int><intname="4">369</int><intname="5">266</int><intname="6">272</int></lst>

You can imagine how this data could easily be used to build visualizations of the best times to query Stack Overflow for your particular topic. And actually, we are in the process of building such a visualization capability right now. See Patricia’s new post for an example.

Also, if you’re interested in playing with this yourself, check the repo on GitHub.

Published at DZone with permission of John Berryman, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)