Performance Zone is brought to you in partnership with:

I specialise MySQL Server performance as well as in performance of application stacks using MySQL, especially LAMP. Web sites handling millions of visitors a day dealing with terabytes of data and hundreds of servers is king of applications I love the most. Peter is a DZone MVB and is not an employee of DZone and has posted 195 posts at DZone. You can read more from them at their website. View Full User Profile

Adventures in Archiving

02.11.2013
| 1104 views |
  • submit to reddit

This post comes from at the MySQL Performance Blog.

One of our Remote DBA service clients recently had an issue with size on disk for a particular table; in short this table was some 25 million rows of application audit data with an on disk size of 345GB recorded solely for the purposes of debugging which may or may not occur.

Faced with the task of both reducing the size on disk and ensuring that the data is always available at all times (firmly ruling out off device archive), we began to explore the available options for the client.

We settled on the following options for further testing.

  1. InnoDB’s inbuilt row compression
  2. MySQL’s inbuilt compress() function leveraging zlib to provide compression of data.
  3. MySQL’s Archive storage engine.

Tooling

The tooling phase consisted of producing scripts to generate representative payloads for comparison, in this case this would be some IMDB XML data sliced to sit within the VARCHAR(1000) fields per client requirements.

First off the schema tables.

CREATE TABLE `archive_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `dat` varchar(1000) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=ARCHIVE DEFAULT CHARSET=latin1;
CREATE TABLE `compressed_data` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `dat` blob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `compressed_row` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `dat` varchar(1000) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4;
CREATE TABLE `non_compressed_data` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `dat` varchar(1000) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

As can be seen we have the following:

  • archive_table (The only change here is the Engine)
  • compressed_data (The change here being VARCHAR(1000) -> BLOB for the intentions of using compress() )
  • compressed_row (The row format being set to compressed, ensure you have innodb_file_format Barracuda as Antelope does not support compression)
  • non_compressed_data (Our test control)

Now we needed to generate the ‘payload’ data for this I quickly authored a bash script for the task to provide an approximate 1M rows per table.

Scripts are available at GitHub

The script adventures_in_archiving_load_schema.sh will first drop and create the schema adventures_in_archiving.

It will then attempt to regenerate the payload file adventures_in_archiving_data.sql (4GB) if it does not exist in the current folder.

Subsequently loading the payload into the adventures_in_archiving schema, this is by no means a fast process however, taking some 4 minutes for data generation and a subsequent 63 minutes for import.

Results

Size on disk Engine Compression Ratio
7.1M Archive 173.07:1
304M InnoDB row_format=compressed key_block_size=4 4.04:1
648M compress() 1.897:1
1.2G InnoDB 1:1

Immediately Archive seemed like the obvious choice there are however bugs to be taken into consideration..

Due diligence

What follows is a list reduced to the most likely to affect production deployments.

  • Use of Auto Increment is not recommended. #37182  #37871 #40216
  • If disk level is already critical and the originating table is MyISAM there may be data loss if there is not enough space to complete the operation.  #37648
  • Use of INSERT DELAYED is not recommended. #48339
  • Use of statement based replication is not recommended. #48343
  • Flush buffers before attempting a table checksum. #57188
  • Repair tools will not function on Archive tables. #62054

Summary

Archive in this instance is the solution to the issue of disk storage however you must always take into acounts the caveats and bugs related to a storage engine before going ahead and using it. Also you need to be aware that Archive does not have good performance.

e.g.

mysql> select id from archive_table order by id desc limit 1;
+---------+
| id      |
+---------+
| 1143883 |
+---------+
1 row in set (2.04 sec)
mysql> select id from compressed_row order by id desc limit 1;
+---------+
| id      |
+---------+
| 1143883 |
+---------+
1 row in set (0.00 sec)

Also you must flush the buffer for example by running a select on the table before inserts are written to disk.

Followup

A few alternative suggestions have been made, as such there will be a follow up post taking a quick look at the size on disk for TokuDB and Infobright, soon.



Published at DZone with permission of Peter Zaitsev, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)