NoSQL Zone is brought to you in partnership with:

I am an author, speaker, and loud-mouth on the design of enterprise software. I work for ThoughtWorks, a software delivery and consulting company. Martin is a DZone MVB and is not an employee of DZone and has posted 83 posts at DZone. You can read more from them at their website. View Full User Profile

Martin Fowler on Polyglot Persistence

  • submit to reddit

In 2006, my colleague Neal Ford coined the term Polyglot Programming, to express the idea that applications should be written in a mix of languages to take advantage of the fact that different languages are suitable for tackling different problems. Complex applications combine different types of problems, so picking the right language for the job may be more productive than trying to fit all aspects into a single language.

Over the last few years there's been an explosion of interest in new languages, particularly functional languages, and I'm often tempted to spend some time delving into Clojure, Scala, Erlang, or the like. But my time is limited and I'm giving a higher priority to another, more significant shift, that of the DatabaseThaw. The first drips have been coming through from clients and other contacts and the prospects are enticing. I'm confident to say that if you starting a new strategic enterprise application you should no longer be assuming that your persistence should be relational. The relational option might be the right one - but you should seriously look at other alternatives.

One of the interesting consequences of this is that we are gearing up for a shift to polyglot persistence [1] - where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.

This polyglot affect will be apparent even within a single application[2]. A complex enterprise application uses different kinds of data, and already usually integrates information from different sources. Increasingly we'll see such applications manage their own data using different technologies depending on how the data is used. This trend will be complementary to the trend of breaking up application code into separate components integrating through web services. A component boundary is a good way to wrap a particular storage technology chosen for the way its data in manipulated.

This will come at a cost in complexity. Each data storage mechanism introduces a new interface to be learned. Furthermore data storage is usually a performance bottleneck, so you have to understand a lot about how the technology works to get decent speed. Using the right persistence technology will make this easier, but the challenge won't go away.

Many of these NoSQL option involve running on large clusters. This introduces not just a different data model, but a whole range of new questions about consistency and availability. The transactional single point of truth will no longer hold sway (although its role as such has often been illusory).

So polyglot persistence will come at a cost - but it will come because the benefits are worth it. When relational databases are used inappropriately, they exert a significant drag on application development. I was recently talking to a team whose application was essentially composing and serving web pages. They only looked up page elements by ID, they had no need for transactions, and no need to share their database. A problem like this is much better suited to a key-value store than the corporate relational hammer they had to use. A good public example of using the right NoSQL choice for the job is The Guardian - who have felt a definite productivity gain from using MongoDB over their previous relational option.

Another benefit comes in running over a cluster. Scaling to lots of traffic gets harder and harder to do with vertical scaling - a fact we've known for a long time. Many NoSQL databases are designed to operate over clusters and can tackle larger volumes of traffic and data than is realistic with single server. As enterprises look to use data more, this kind of scaling will become increasingly important. The Danish medication system described at gotoAarhus2011 was a good example of this.

All of this leads to a big change, but it won't be rapid one - companies are naturally conservative when it comes to their data storage.

The more immediate question is which types of projects should consider an alternative persistence model? My thinking is that firstly you should only consider projects that are at the strategic end of the UtilityVsStrategicDichotomy. That's because utility projects don't have enough benefit to be worth a new technology.

Given a strategic project, you then have two drivers that raise alternatives: either reducing development drag or dealing with intensive data needs. Even here I suspect many projects, probably a majority, are better off sticking with the relational orthodoxy. But the minority that shouldn't is a significant one.

One factor that is perhaps less important is whether the project is new, or already established. The Guardian's shift to MongoDB has been happening over the last year or so on a code base developed several years ago. Polyglot persistence is something you can introduce on an existing code base.

What all of this means is that if you're working in the enterprise application world, now is the time to start familiarizing yourself with alternative data storage options. This won't be a fast revolution, but I do believe the next decade will see the database thaw progress rapidly.

1: As far as I can tell, Scott Leberknight was the first person to start using the term "polyglot persistence".

2: Don't take the example in the diagram too seriously. I'm not making any recommendations about which database technology to use for what kind of service. But I do think that people should consider these kinds of technologies as part of application architecture.


Published at DZone with permission of Martin Fowler, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Adam Gent replied on Thu, 2011/11/17 - 8:11am

I was kind of hoping you would mention how some of the databases are offering asynchronous drivers/connectivity. This is becoming rather important with Node.js like async platforms and should be a determinate in your persistence choice.

Many of the traditional database drivers in traditional languages are not asynchronous network wise. That is they use a thread from a pool, open a socket to the database and wait for the database to finish its operation. There are some projects for Java like adbcj but they are rather immature. (In general most languages are lacking in event io including Java).

MongoDB, Redis, Postgresql and I believe Mysql now have asynchronous (event io) driven connectivity. Its important to consider these offerings if you have an Event IO loop platform.

Rick Jensen replied on Thu, 2011/11/17 - 10:20am

While the proliferation of database options is great, there does not seem to be any good information (at least not any that is easy to find) regarding what technologies are more suitable for particular services or for storing particular kinds of data. Many organizations are stuck in the RDBMS rut and have no hope of getting out of it without some serious education about the options available and when to use them.

If anyone knows of good resources that compare / contrast different database technologies and the advantages / disadvantages of using them for different services, please share them. An article or resource that goes through the categories above by describing the characteristics of each service in terms of data structure and performance / useages, and then goes through some of the major database technologies that are suitable or not suitable for meeting the data needs of that service (and why), would be ideal.

If someone knows of such an article (or perhaps a series of them), please share!

David Parks replied on Thu, 2011/11/17 - 5:50pm in response to: Rick Jensen

Seconded.  It feels like it'll take a year to do the research to figure out the best datastore layout for my CMS-like problem.  The pages for the various databases is no help as they all (reasonably) try to explain why they are versatile enough for every problem.  Reasonable but not helpful.


Aaron Digulla replied on Tue, 2011/11/29 - 11:52am

I'm waiting for a meta-database that solves the join problem (joining sets over several data storage options). Without that, business will hate this approach because all non-SQL sources will be unreachable for their beloved reports.

Rick Jensen replied on Wed, 2011/11/30 - 5:13pm in response to: Aaron Digulla

I believe the currently recommended way to report on data that, in production, lives in multiple homes is to break out the reporting system from the OLTP system. The use-case of reporting it, as you suggest, often best handled by a relational DB, since it is both highly familiar and also good at connecting data. (This is, of course, a different matter if the volume is massive, in which case there is likely a map-reduce-type system used to aggregate data before it goes into the relational reporting database)

The benefit of breaking the reporting out from the production system is also seen in the application and report performance, application complexity, and scaleability. By pulling the data out for reporting on, you can scale the reporting system and the production system separately as needed, and traffic on each doesn't impact the other.

The complexity is reduced because the mechanism to pull data from the OLTP system into the reporting system isn't a part of the OLTP production system. The reporting needs _pull_ data that they want, rather than the production app _pushing_ the data. It shifts the work to a separate system.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.