The Optimization That (Often) Isn’t: Index Merge Intersection
Prior to version 5.0, MySQL could only use one index per table in a given query without any exceptions; folks that didn’t understand this limitation would often have tables with lots of single-column indexes on columns which commonly appeared in their WHERE clauses, and they’d wonder why the EXPLAIN plan for a given SELECT would show N possible index choices but only one index actually used.
To some extent, MySQL 5.0 and later changed this situation with the introduction of the “index merge” optimizer plan. The basic idea behind index merge is that for certain types of queries which contain WHERE clauses with columns that had single-column indexes on them, MySQL could sometimes make use of the multiple indexes. For instance, “SELECT foo FROM bar WHERE indexed_colA = X OR indexed_colB = Y” might use the index merge union algorithm, which would *simultaneously* (this is important, as we will see below) scan the indexes on indexed_colA and indexed_colB and then do a set-theoretic union of the two result sets. Replace that “OR” with an “AND” and you might see the set-theoretic intersection of the result sets. Sounds like this could be a significant performance win, perhaps. Sometimes it is. Other times, it is a major performance killer.
It’s fairly straightforward to tell when MySQL is doing this; run an EXPLAIN on your SELECT and you’ll see “index_merge” as the type of query and the type of index merge algorithm in the “Extra” field. Here’s an example of a table and a query which I ran into recently that did indeed attempt to use index_merge, but with disappointing results.
CREATE TABLE users ( user_id INT UNSIGNED NOT NULL AUTO_INCREMENT, parent_id INT NOT NULL DEFAULT 0, status TINYINT UNSIGNED NOT NULL DEFAULT 0, user_type TINYINT UNSIGNED NOT NULL DEFAULT 0, username VARCHAR(20), ... other columns here ... PRIMARY KEY(user_id), INDEX `parent_id` (parent_id), INDEX `status` (status), INDEX `user_type` (user_type) ) ENGINE=InnoDB; SELECT user_id FROM user WHERE user_type=2 AND status=1 AND parent_id=0 AND user_id > 2938575 ORDER BY user_id LIMIT 1;
The users table in question here had approximately 4.5M rows, and an EXPLAIN produced the following execution plan:
*************************** 1. row *************************** id: 1 select_type: SIMPLE table: users type: index_merge possible_keys: PRIMARY,parent_id,status,user_type key: user_type,status,parent_id key_len: 2,2,4 ref: NULL rows: 8100 Extra: Using intersect(user_type,status,parent_id); Using where; Using index; Using filesort
At first glance, this might not look too bad. MySQL is using three different indexes to search approximately 8100 out of 4.5M rows to return the data we want. Since we’re only requesting the user_id column, which is the PK for this table and implicitly appended to each secondary index, our merged indexes are covering for the query, so that’s also good. When we ran it, however, it took 3 seconds. Not only that, but it came up at the top of a pt-query-digest report which showed that it was the most commonly executed query on the server by two orders of magnitude. Other queries were running maybe 1000 or 2000 times, but this one?
# Rank Query ID Response time Calls R/Call Apdx V/M It # ==== ================== ================== ====== ======== ==== ===== == # 1 0xCF90A421D9FB8717 1022239.0748 66.9% 311599 3.2806 0.50 0.02 SELECT user
Note the low V/M here. This query *always* takes about 3.3 seconds to execute. Yet it appears so benign. What gives?
One word: selectivity. Out of 4.5M rows in the table, 4M of them had status = 1, and 4.4M of them had parent_id = 0. Only 17K, however, matched user_type = 2. It’s worth noting here that the (parent_id, status, user_type) = (0, 1, 2) tuple chosen by pt-query-digest as the sample query wasn’t a fluke or unfortunately-chosen representative; in talking to the customer, I discovered that it’s this particular set of conditions which was *always* used by this query. We wouldn’t see, for instance, this query looking for a different status or a different user type. When we think about what this means in the context of index merge, it means that MySQL is really looking at a search space closer to 8.5M rows, not 8100, because it has to read the index entries for each of the columns in the WHERE clause that are included in this merge operation and then perform a set intersection on the results. Here’s one situation where trying to do multiple things at the same time ends up taking longer – if we stopped after retrieving the 17k rows matching user_type=2, we’d save ourselves a lot of work. Ooops.
The good news is that once identified, there are multiple ways to fix this issue, and they are all straightforward to implement. The bad (sort of) news is that each situation is going to be different, and there isn’t likely to be a one-size-fits-all silver bullet solution.
For example, one option would be to simply adjust the optimizer_switch configuration setting and disable index_merge_intersection. It’s a dynamically-adjustable setting and can be modified globally or per-session:
SET [GLOBAL|SESSION] optimizer_switch="index_merge_intersection=off";
This is the database equivalent of kicking the door in when you’ve forgotten your keys. Yes, it’s quick, and it will get you back in the house, but now you have no door, and your cats might get out. Or your neighbors will see you walking around in your underwear. Or any of a panoply of other unintended and potentially undesirable consequences, many of which you might not see right away. Disabling this setting might speed up this query (it did), but it might also negatively impact other queries which were benefitting from this optimization (it did that, too). Effective for quick testing (e.g., to find out if index merge is actually a problem) but probably not the best way to go for a permanent solution.
Another option is, of course, to change the indexing on the table. For instance, if 4.4M out of 4.5M rows have parent_id = 0, we might be tempted to remove the index on parent_id if queries for parent_id != 0 could be satisfied in some other way. After all, if an index isn’t there, MySQL can’t attempt to use it, but this has the same challenges that are present with the previous solution. What we do in order to help one query execute more quickly may end up creating unintended negative consequences elsewhere.
We could rewrite the query as a sub-select. This will work properly, and it won’t interfere with other queries, but IMHO it’s not very elegant and it’s a bit messy:
SELECT user_id FROM (SELECT user_id,parent_id,status FROM users WHERE user_type=2 AND user_id > 2938575 ORDER BY user_id) t WHERE t.parent_id=0 AND t.status = 1 LIMIT 1
Finally, we can use index hints. Index hints are exactly what they sound like – “help” for the optimizer to favor or disfavor a particular index or set of indexes so that the query executes in a specific fashion. Index hints are fairly unintrusive and can be applied on a query-by-query basis, so there’s no need to go mucking about with server configuration, altering the table structure, or creating messy sub-selects. In cases where the optimizer just can’t get things right, and you, as the developer or DBA, know exactly how the server should be finding and manipulating data, they can be quite effective.
In the case of this query, since we know that it’s the index on (user_type) that is the one that MySQL should be looking at, we can rewrite the query like so, with the following result:
SELECT user_id FROM users USE INDEX(user_type) WHERE user_type=2 AND user_id > 2938575 AND parent_id=0 AND status=1 ORDER BY user_id LIMIT 1; mysql> EXPLAIN SELECT user_id FROM users USE INDEX(user_type) WHERE user_type=2 AND user_id > 2938575 AND parent_id=0 AND status=1 ORDER BY user_id LIMIT 1\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: users type: ref possible_keys: user_type key: user_type key_len: 2 ref: const rows: 32404 Extra: Using where 1 row in set (0.00 sec)
Making this change turns what was a 3 second query into a millisecond query. It’s hard to argue with success, although in many cases I recommend against using index hints for one very simple reason: stuff happens. Just a few possibilities:
- A new version of MySQL is released that has some new optimizer functionality that you end up missing out on.
- Junior developers might see USE INDEX() and suddenly start believing that they can always outsmart the optimizer by manually manipulating indexes in every query they write (not that I have ever seen this happen or anything).
- The person that originally added the USE INDEX() leaves the company and take the knowledge and rationale behind it out the door.
- The data change so much such that the index hint is now doing more harm than good.
Caveats aside, there are some very good reasons to use index hints for this particular query, and it was ultimately the approach I suggested as a short/medium-term solution while they did some re-work on the process as a whole. When I came back the next day and we looked at the Cacti reports, the “InnoDB rows examined” graph had dropped several orders of magnitude and CPU utilization on the machine had dropped by about 2/3. Not a bad outcome at all.
And so, what do we learn here?
- Just because the output from EXPLAIN “appears to look good” that it actually is good.
- Whenever you see index_merge_intersection, it’s a good time to make sure that the indexes involved have good selectivity (a good practice in general, to be sure).
- As they say in the Perl community, there’s more than one way to do it. Otherwise known as beware the law of unintended consequences.
- Sometimes we have to sacrifice “elegance” (IMHO, index hints are inelegant and hackish) to produce the best possible outcome. C’est la vie.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)