SQL Zone is brought to you in partnership with:

Stacey Schneider is focused on helping evangelize how cloud technologies are transforming application development and delivery for VMware. Prior to its acquisition, Stacey led marketing and community management for application management software provider Hyperic, now a part of VMware’s management portfolio. Before her work in the cloud, she also held various technical leadership positions at CRM software pioneer Siebel Systems, including Director of Technology Product Marketing, managing the Technology Competency in Europe, and the Globalization professional services practice. She was also a part of Siebel’s Nexus project, which focused on building portable web applications that could be deployed across java application servers as well as .NET. Stacey is also the managing principal of SiliconSpark, a consulting agency that has helped over 12 software companies go to market on the web and across the cloud over the past 4 years. Stacey has posted 39 posts at DZone. You can read more from them at their website. View Full User Profile

Load 8 Million Rows in 88 Seconds – NewSQL Speed

09.05.2012
| 3441 views |
  • submit to reddit

The content of this article was originally written by Pas Apicella.

At some point, any data modernization project is going to require a load of legacy data. With an in-memory, distributed data store like SQLFire, customers often ask (like in this case) about load times because they can be sitting on 50-100 GB of data and don’t want to wait days. For those unfamiliar with NewSQL databases, this post should give you a good sense of how we loaded 8 million rows in 88 seconds. The test shows how we should be able to load roughly 40GB of data in about 1 hour.

For Java developers who want ideas about speeding up large volumes of calculations, transforms, or validations, you may want to consider a previous post, where SQLFire is used with Spring Batch.

With SQLFire, we take a multi-threaded load approach from a CSV file. Below, I’ve outlined 8 steps to the load strategy with an explanation of why things were done.

Please note:

  • This is based on SQLFire 1.0.3.
  • In this setup we have 2 VM’s each with 18G of memory, sharing the same home directory as well as having separate disks for their own disk stores/log files.

1. Setup SQLFire in your path as shown below.

http://arunma.com/2011/11/prototype-design-pattern-mindmap-and-implementation/

2. On VM1 start a locator as shown below

sqlf locator start -J-javaagent:/export/w2-gst-cert-20a/users/swale/sqlfire/build-
artifacts/linux/product- sqlf/lib/sqlfire.jar -peer-discovery-address=w2-gst-cert-21 -peer-discovery-
port=41111 -client-bind- address=w2-gst-cert-21 -client-port=1527 -dir=/w2-gst-cert-
21a/users/papicella/locator &

 

3. On both VM1 and VM2 start 2 SQLFire server nodes as shown below.
These servers have 18G of memory. VM2 is where the locator process was started. The output below is just showing what the scripts you need to run those to actually start the servers.

[Thu Jul 12 21:24:47 papicella@:~/pas/demo/load-demo ] $ cat start-server-w2-gst-cert-20.sh sqlf
server start -J-javaagent:/export/w2-gst-cert-20a/users/swale/sqlfire/build-artifacts/linux/product-
sqlf/lib/sqlfire.jar -server-groups=MYGROUP -initial-heap=8024m -max-heap=8024m -locators=w2- gst-cert-
21[41111] -client-bind-address=w2-gst-cert-20 -client-port=1527 -dir=/w2-gst-cert-20a/users/
papicella/server1 &

[Thu Jul 12 21:27:19 papicella@:~/pas/demo/load-demo ] $ cat start-server-w2-gst-cert-21.sh sqlf server start -J-javaagent:/export/w2-gst-cert-20a/users/swale/sqlfire/build-artifacts/linux/product- sqlf/lib/sqlfire.jar -server-groups=MYGROUP -initial-heap=8024m -max-heap=8024m -locators=w2- gst-cert-21[41111] -client-bind-address=w2-gst-cert-21 -client-port=1528 -dir=/w2-gst-cert-21a/users/ papicella/server2 &

4. Create a diskstore using SQL as shown below.

CREATE DISKSTORE STORE1 WRITEBUFFERSIZE 19292393;

5. Set a global eviction policy for tables in the group “MYGROUP”.

call sys.set_eviction_heap_percentage_sg (85, 'MYGROUP');

6. Create a table as shown below.
We are using a separate disk store for persistence and any overflow data will simply be removed from memory and not written to disk as persistence is enabled here. This table will asynchonously write it’s changes to the disk store so that it only has to do the memory insert prior to moving onto the next change.

drop table emp;

CREATE TABLE emp
(empno integer NOT NULL primary key,
ename varchar(20),
hiredate date,
deptno int
)
SERVER GROUPS (MYGROUP)
EVICTION BY LRUHEAPPERCENT EVICTACTION OVERFLOW
PARTITION BY COLUMN (empno)
REDUNDANCY 1
PERSISTENT 'STORE1' ASYNCHRONOUS;

7. Pre-create buckets ready for loading data by running the SQL as shown below.
If this isn’t done, then it will be done the first time an insert occurs where no buckets exist automatically by SQLFire.

call sys.create_all_buckets('APP.EMP');

Note: You can do this after the load as shown below.

call sys.rebalance_all_buckets()

8. Run a script to load the data showing output has taken 88 seconds to load 8 million rows into a 2 node cluster.
In this example, we load a CSV file named “EMP.dat” which has 8 million rows with data as follows and a disk size of 371M.

Example:

1,LkrpCkpGQpsUvZrFWAic,2012-02-29 05:35:10,2288856
2,gfbuiiWcUxoDOgeHuth,2012-01-10 12:57:18,1243600
3,GnR,2012-03-04 20:47:06,1533872
4,uxaGakuW,2012-03-17 20:57:34,2451659
5,coIrAvnPtvOZJirnea,2012-01-29 15:36:20,3731769
6,FJvnIjy,2012-03-24 07:37:59,4137455
7,rjnWsfbLldRtUSBDdZew,2012-01-24 13:09:22,3416189
8,DpbvPBkalbuZbBItgoJp,2012-03-28 13:41:56,6497992
9,jaba,2012-01-30 00:56:13,151525
10,EkDdiPeHBn,2012-04-02 12:05:25,630473
.....

 

The “EMP.dat” exists on only VM2 so we need to ensure we connect only to that SQLFire server node and ensure it’s the only one responsible for the load given EMP.dat only exists on that server. To do that we use load-balance=false and we use read-timeout=0 to ensure the client doesn’t time out if the load takes longer then 5 minutes.

loadEmp.sh Script

sqlf <<!
connect client 'w2-gst-cert-21:1528;load-balance=false;read-timeout=0';
ELAPSEDTIME on;
call syscs_util.import_table_ex('APP' /* schema */,
'EMP' /* table */,
'/w2-gst-cert-21a/users/papicella/data/EMP.dat' /* file path as seen by server */,
',' /* field separator */,
NULL,
NULL,
0,
0 /* don't lock the table */,
6 /* number of threads */,
0,
NULL /* custom class for data transformation or NULL to use the default inbuilt Import class */,
NULL);
!

 

Note: The number of threads set at 6 is more then enough. Making this to high can have an adverse effect here. Six is a perfect amount for a multi thread load test.

[Thu Jul 12 19:03:23 papicella@:/w2-gst-cert-21a/users/papicella/data ] $ ./loadEMP.sh
sqlf version 10.4
sqlf> sqlf> sqlf> > > > > > > > > > > Statement executed.
ELAPSED TIME = 88324 milliseconds
sqlf>
[Thu Jul 12 19:04:58 papicella@:/w2-gst-cert-21a/users/papicella/data ] $

 

Finally run a query as follows showing that indeed the data has been loaded and what the overhead of storing 8,000,000 rows was on the system for both nodes from a memory perspective.


sqlf> select sqlentity, memory, substr(id, 1, 50) "Id" FROM sys.memoryAnalytics -- SQLFIRE-PROPERTIES sizerHints=withMemoryFootPrint;
SQLENTITY
|MEMORY
|Id
------------------------------------------------------------------------------------------------------------------------------
APP.EMP (Entry Size, Value Size, Row Count) | 832000000,204019274,8000000 | w2-gst-cert-21(10674)<v2>:2810/48871

APP.EMP (sqlfire,gemfire,others) | 241842808,229176368,196449984 (667469160 = 636.55 mb) | w2-gst-cert-21(10674)<v2>:2810/48871

APP.EMP (Entry Size, Value Size, Row Count) | 832000000,204019274,8000000 | w2-gst-cert-20(8550)<v1>:35007/36633

APP.EMP (sqlfire,gemfire,others) | 277524448,262689608,225528376 (765742432 = 730.27 mb) | w2-gst-cert-20(8550)<v1>:35007/36633

4 rows selected
ELAPSED TIME = 149673 milliseconds
sqlf>

More Information

A. Row Overhead

Currently SQLFire overhead for an entry that is persistent is around 100 bytes so that adds to around 800M overhead for 8,000,000 rows. This needs to be factored in when trying to determine memory size for tables when using persistence.

B. Number of threads to load with

When using syscs_util.import_table_ex it is best to not use more than 6 threads for the load. The benefit won’t increase if you add more threads then 6. The example below simply uses 6 threads to load the data into SQLFire, which is the most efficient way.

sqlf < connect client 'w2-gst-cert-21:1528;load-balance=false;read-timeout=0';
ELAPSEDTIME on;
call syscs_util.import_table_ex('APP' /* schema */,
'EMP' /* table */,
'/w2-gst-cert-21a/users/papicella/data/EMP.dat' /* file path as seen by server */,
',' /* field separator */,
NULL,
NULL,
0,
0 /* don't lock the table */,
6 /* number of threads */,
0,
NULL /* custom class for data transformation or NULL to use the default inbuilt Import class */);
!

 

C. Connecting as a client where the data file exists

In the example below we ensure that long running load clients don’t timeout. By default, they will timeout after 5 minutes so to avoid that we use read-timeout=0 for the connect string attribute. At the same time we want to do the load from only one client where the data file exists. In that case we ensure we connect to the client where the data file exists and turn load-balance off to ensure the other node doesn’t attempt to read the file as well.

sqlf < connect client 'w2-gst-cert-21:1528;load-balance=false;read-timeout=0';
ELAPSEDTIME on;
call syscs_util.import_table_ex('APP' /* schema */,
'EMP' /* table */,
'/w2-gst-cert-21a/users/papicella/data/EMP.dat' /* file path as seen by server */,
',' /* field separator */,
NULL,
NULL,
0,
0 /* don't lock the table */,
6 /* number of threads */,
0,
NULL /* custom class for data transformation or NULL to use the default inbuilt Import class */);
  
Published at DZone with permission of its author, Stacey Schneider. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)