NoSQL Zone is brought to you in partnership with:

Ayende Rahien is working for Hibernating Rhinos LTD, a Israeli based company producing developer productivity tools for OLTP applications such as NHibernate Profiler (nhprof.com), Linq to SQL Profiler(l2sprof.com), Entity Framework Profiler (efprof.com) and more. Ayende is a DZone MVB and is not an employee of DZone and has posted 485 posts at DZone. You can read more from them at their website. View Full User Profile

Sharding vs. Having Multiple Databases

11.14.2011
| 8675 views |
  • submit to reddit

I was recently at a customer site, and he made a mention of sharding their database in a specific way, putting all of the user generated content on a separate server. I jumped in and told him that this isn’t sharding. And then I had to explain, both to the customer and to myself, why splitting things up in that way bothered me.

The basic idea is the same, we want to split the information over several databases to reduce the load on each individual server, and get higher capacity. We’ll use the Flickr model for this post. What the client suggested was:

image


And what I suggested was:

image


On the face of it, there isn’t much difference, right? But I really don’t like calling the first example sharding. From my perspective, this is merely splitting the data into multiple databases, it isn’t sharding. I’ll accept arguments about this still being sharding, but it doesn’t feel right from my perspective.

The main problem with the first option is that in order to actually do something interesting, you have to go to three different servers. On the other hand, by selecting a good sharding function, you can usually serve a single request completely from a single database.

That is important, because databases, whatever they are relational or RavenDB all include multiple ways for you to gather several pieces of information in an efficient manner. The common part about all of those efficient ways to gather multiple pieces of information? They all break down when you are trying to gather different pieces of information from different servers. That is leaving aside things like connection pooling and persistent connections, which are also quite important.

For example, loading an image with its user’s information and its comments would be the following three queries in the first example:

var image = imagesSession.Load<Image>("images/1");
var user = usersSession.Load<User>(image.UserId)
var comments = commentsSession.Load<ImageComments>(images.CommentsId);


Whereas with the second example, we will have:

session
     .Include<Image>(x=>x.UserId)
     .Include(x=>x.CommentsId)
     .Load("images/1");
      

Other things, like transactions across the data set, are also drastically simpler when you are storing all of the related data on the same server.

In short, sharding means splitting the dataset, but a good sharding function would make sure that related information is located on the same server (strong locality), since that drastically simply things. A bad sharding function (splitting by type) would create very weak locality, which will drastically impact the system performance down the road.

Published at DZone with permission of Ayende Rahien, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Liran Zelkha replied on Mon, 2011/11/14 - 8:00am

If you're interested in sharding - try giving ScaleBase a chance - they build a transparent sharding solution for relational databases.

Philopator Ptolemy replied on Mon, 2011/11/14 - 8:41am

For us the most important reason to go with real sharding is that it is infinitely more scalable. As your user base grows, one server will not be enought to handle a signle function, be it users, or images or anything else. What do you do then? Whereas with true sharding you just keep adding and rebalancing shards.

Liran Zelkha replied on Mon, 2011/11/14 - 3:42pm in response to: Philopator Ptolemy

I agree. But even if you distribute customers based on first letter you might run into issues. Some letters can obviously reside on one server, to save resources. Some letters will span across multiple databases. Sharding based on first word or based on functionality is usually not a good sharding decision.

Lindsay Gordon replied on Wed, 2011/11/16 - 1:29pm

Hi all, glad to see that you are enjoying the NoSQL Zone!  Thanks for commenting on Ayende's article, we hope that you are finding other posts that are also worthy of discussion! 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.