I am still a big nerd in a (not so) small body. A technology freak with years of experience on anything from PL1/mainframe to LAMP. I love to code and talk about coding, especially on state of the art technologies - but only those that make some sense. My specialty is taking good technology and turning it to a cool product. I've done this in my previous roles, and I'm doing it right now with my new company, ScaleBase. Liran has posted 21 posts at DZone. View Full User Profile

Tip: How do you know when to shard your database

  • submit to reddit
We at ScaleBase talk about sharding so much, it’s difficult for us to see why someone wouldn’t want to shard. But just because we’re so enthusiastic about our transparent sharding mechanism, it doesn’t mean we can’t understand the very basic question, “When do I shard?” Well, it’s not the most difficult question to answer. I’ll keep it short: if your database exceeds the memory you have on a single machine, you should shard. If you hit I/O, your performance suffers, and sharding will assist. Why? That’s easy to explain. Databases in general (and MySQL is no exception) try to cache data. Because accessing memory is so much faster than accessing disk (even with SSDs), database providers have developed rather sophisticated caching algorithms. For instance, running a query caches the query and its results. Indexes are stored in memory so that, when running a query, the database doesn’t have to hit the disk twice. But if the database is big, it won’t fit into memory. Sometime even the index won’t fit into memory. This is when you start seeing database performance degradation. So the best date to start sharding is when you can’t add more memory to your database server. This can come sooner rather than later. As we all know, data is booming, and if you’re running in the cloud there is only so much memory your cloud provider will give you. With sharding, every machine has its own data, which fits in RAM. And if you need more – just add an additional shard. The other parameter is the number of concurrent connections. If you reach the limit of connections your machine can handle, it’s time to shard your database. Every sharded database gets less hits/second, requires fewer connections – and can work faster. So, if your database does not fit in memory, or if you have too many concurrent users hitting your database – try out ScaleBase, for our transparent sharding solution. To learn more, try out ScaleBase Transparent Sharding solution at www.scalebase.com
Average: 3 (2 votes)
Published at DZone with permission of its author, Liran Zelkha.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Nigel Eiland replied on Sun, 2011/11/20 - 10:50am

Not a big fan of paragraphs then?

Roger Lindsjö replied on Mon, 2011/11/21 - 7:32am

Basically what you are saying is that if it is too big to fit in ram (and you can not get more ram), get more machines (with more ram).

Are you seriously saying that all data should be no further away than ram? That would increase the cost of datawahouses quite a lot I guess.

And if so, why should it only apply to databases? Why not the file system itself? I need to access logfiles from my runtimes, say a few hundred GBs of data. For maximum performance this should all of course be in ram, but that would not be cost effective. For every 1TB disk I buy I also buy 1TB of ram?

I think sharding is a great way to go (and your technology seems interresting), but a broad statement such as the one you did makes you look not very serious in my eyes. 

Prem Kurian Philip replied on Wed, 2011/11/23 - 2:09pm in response to: Nigel Eiland

Ah! apparently not!



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.