Index Prioritization in RavenDB: Problems
Every now and then we get a request for index prioritization in RavenDB. The requests are usually in the form of:
I have an index (or a few indexes) that are very important, and I would like them to be update before any other indexes.
That is really nice request, but it ignores a lot of actual really hard implementation details.
In any prioritization scheme, there is the risk of starvation. If index A has to complete before index B, what happens if we have enough writes that index A is always busy? That means that index B will never get to index, and it will fall further & further behind. There are well known algorithms to handle this scenario, mostly from the OS thread scheduling point of view.
You could incrementally increase the priority of an index every time that you skipped updating it, until at some point it has higher priority than all the other indexes and gets its moment in the sun. That is workable if all you are working with are threads, and there isn’t a significantly different execution environment for a thread to run.
For RavenDB indexes, there is actually a major difference in the execution environment depending on when you are running. We have a lot of optimizations inside RavenDB to avoid IO, in particular, we do a lot of work so indexes do not have to wait for their input, we do parallel IO, optimized insert hooks, and a whole bunch of stuff like that. All of those assume that you all of the indexes are going to run together, however.
We already have the feature that a very slow index will be allowed to run while the rest of the indexes are keeping up, but that is something that we really try to avoid (we give it a grace period of 3/4 as much time as all of the other indexes combined). That is because the moment you have out of sync indexes, all that hard work is basically going to be wasted. You are going to be needing to load the documents to be indexed multiple times, creating more load on the server. Keeping the documents that were already indexed waiting in memory for the low priority index to work on is also not a good idea, since that is going to cause RavenDB to consume potentially a LOT more memory.
I have been thinking about this for a while, but it isn’t an easy decision. What do you think?
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)