Big Data/Analytics Zone is brought to you in partnership with:

Rafal Kuc is a team leader and software developer. Right now he is a software architect and Solr and Lucene specialist. Mainly focused on Java, but open on every tool and programming language that will make the achievement of his goal easier and faster. Rafal is also one of the founders of solr.pl site where he tries to share his knowledge and help people with their problems. Rafał is a DZone MVB and is not an employee of DZone and has posted 75 posts at DZone. You can read more from them at their website. View Full User Profile

SolrCloud HOWTO

03.14.2013
| 5797 views |
  • submit to reddit

What is the most important change in 4.x version of Apache Solr? I think there are many of them but Solr Cloud is definitely something that changed a lot in Solr architecture. Until now, bigger installations suffered from single point of failure (SPOF) – there was only the one master server and when this server was going down, the whole cluster lose the ability to receive new data. Of course you could go for multiple masters, where a single master was responsible for indexing some part of the data, but still, there was a SPOF present in your deployment. Even if everything worked, due to commit interval and the fact that slave instances checked the presence of new data periodically, the solution was far from ideal – the new data in the cluster appeared minutes after commit.

Solr Cloud changed this behavior. In this article we will setup a new SolrCloud cluster from the scratch and we will see how it work.

Our example cluster

In our example we will use three Solr servers. Every server in the cluster is capable of handling both the index and the query requests. This is the main difference from the old-fashioned Solr architecture with single master and multiple slave servers. In the new architecture there is one additional element present: Zookeeper, which is responsible for holding configuration of the cluster and for synchronization of its work. It is crucial to understand that Solr relies on information stored in Zookeeper – if Zookeeper will fail, the whole cluster is useless. Because of this it is very important to have a fault tolerant Zookeeper ensemble and because of this we use three independent instances of Zookeeper that will form the ensemble.

Zookeeper installation

As we said previously, Zookeeper is a vital part of SolrCloud cluster. Although we can use embedded Zookeeper, this is only handy for testing. For production you definitely want your Zookeeper to be installed independently from Solr and run in a different Java virtual machine process to avoid those two interrupting each other and influencing each others work.

The installation of Apache Zookeeper is straight forward and may be described by the following steps:

  1. Download Zookeeper archive from: http://www.apache.org/dyn/closer.cgi/zookeeper/
  2. Unpack downloaded archive and copy conf/zoo_sample.cfg to conf/zoo.cfg
  3. Modify zoo.cfg:
    1. Change dataDir to directory where you want to hold all cluster configuration data
    2. Add information about all Zookeeper servers (see below)

After mentioned changes my zoo.cfg looks like the following one:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/zookeeper/data
clientPort=2181
server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888
  1. Copy this archive to the all servers, where Zookeeper service should be run
  2. Create file /var/zookeeper/data/myid with server identifier. This identifier is different for each instance (for example on zk2 this file should contain 2 number)
  3. Start all instances using “bin/zkServer.sh start-foreground” and verify validity of the installation
  4. Add “bin/zkServer.sh start” to starting scripts and make sure that operation system monitors that Zookeeper service is available.

Solr installation

The installation of Solr is the following:

  1. Download Solr archive from: http://www.apache.org/dyn/closer.cgi/lucene/solr/4.1.0
  2. Unpack downloaded archive
  3. In this tutorial we will use the ready Solr installation from the example directory and all changes are made to this example installation
  4. Copy archive to all servers which are the part of the cluster
  5. Install to Zookeeper configuration data, which will be used by the Solr cluster. For this run the first instance with:

    java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=solr1 -DzkHost=zk1:2181 -DnumShards=2 -jar start.jar

This should be run only once. The next run will use configuration from Zookeeper cluster and local configuration files are not needed.

  1. Run all instances using

java –DzkHost=zk1:2181 –jar start.jar

Verify the installation

Go into administration panel on any Solr instance. For our deployment the URL should be likehttp://solr1:8983/solr. When you click on cloud tab, and graph, you should see something similar to the following screen shot:

cloud

Collection

Our first collection – the collection1 is divided into two shards (shard1 and shard2). Each of those shards is placed on two Solr instances (OK, on the picture you see that every Solr is placed on the same host – I have currently only one physical server available for tests – any volunteers for donation? ;) ). You can see that type of the dot tell us if it is a primary shard or replica.

Summary

I hope this is the first note about solrCloud. I know it is very short and skips details and information about shards, replicas and architecture of this solution. Treat this as a simple checklist for basic, (but real) configuration of your cloud.

Published at DZone with permission of Rafał Kuć, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)