Big Data/Analytics Zone is brought to you in partnership with:

Stephen Forte is the Chief Strategy Officer at Telerik and sits on the board of several start-ups including Triton Works. Stephen is also the Microsoft Regional Director for the NY Metro region and speaks regularly at industry conferences around the world. He has written several books on application and database development, including Programming SQL Server 2008 (MS Press). Stephen holds an MBA from the City University of New York. Stephen is a DZone MVB and is not an employee of DZone and has posted 34 posts at DZone. You can read more from them at their website. View Full User Profile

Building a Data Warehouse, Part 4: Extraction, Transformation, and Load

11.30.2012
| 4538 views |
  • submit to reddit

See also:

In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. In Part III, we looked at where to put the data warehouse tables. Today we are going to look at how to populate those tables and keep them in sync with your OLTP system.

No matter where your data warehouse is located, the biggest challenge with a data warehouse, especially one where you are going to do real time reporting off of, is that the data is published from the transactional system (OLTP). By definition, the data warehouse is out of date compared to the OLTP system.

Usually when you ask the boss, “How much latency can you accept between the OLTP and data warehouse systems” the boss will reply: none. While that is impossible, the more time to develop and the more money you have will to develop said system, the closer to real time you can get. Always bring up the American Express example used in Part I for proof that your system can accept at least some latency.

If you remember the code examples from Part II, most of the time you have to query OLTP data, aggregate it, and then load it into your new schema. This is known as the process of extraction, transformation, and loading (ETL) data from the OLTP system into the data warehouse. The workflow usually goes like this: at a pre-set time (at every change, hourly, nightly, or weekly) query the OLTP data (extraction) and then aggregate and flatten it out (transformation) and then copy the transformed data to the data warehouse (star or snowflake) tables (load).

You have several options ranging from near real time ETL (very complex and expensive) to nightly data dumps (least complex and least expensive.) You can also perform full ETL (wipe out the data warehouse tables completely and refill on each operation) and partial ETL (only add incremental data as per a time series.) The actual load can be done via database triggers on an add/update/delete for near real time to simple scheduled SQL batches that wipe data and run your load scripts. Most likely you will want to take a hybrid approach, depending on the maturity of your system. You have three basic options, ranging from least complex to most complex:

  • Direct database dump
  • ETL tool
  • Database triggers

When you have a long time series to publish your data, say nightly, or weekly, you can do a direct database dump. The process would be pretty straightforward. At a regular interval (or manually) a process would start that would query the OLTP database and perform all of the aggregations, etc and then load it into a staging data warehouse database, then wipe out the production data warehouse and load the data in.

Another option is to use an ETL tool. A good example is SQL Server Integration Services (SSIS) if you are using Microsoft SQL Server. (Actually SSIS will work with multiple database, you just need a SQL Server host.) A modern ETL tool will give you the ability to segment the work into logical groups, have a control flow based on success and failure of a condition, and allows rollbacks.

clip_image002

A typical workflow with an ETL tool is that the ETL will run on a schedule (or based on a condition, such as a message arriving from a queue or a record written to a admin table) and have a parameter(s) passed to it. This parameter is usually a time series and the ETL will perform all of the extraction on data from the OLTP database filtered by that parameter. An ETL tool is the most likely solution you will employ, especially if you have to make frequent updates from your OLTP system to your data warehouse.

Another option is to use database triggers. For those of you that don’t know a lot about triggers, well they can lead to evil. ;) That said, they are events that fire when data changes. You can then write SQL code to run when the data changes, even ETL code. Triggers are hard to debug and difficult to maintain, so I would only suggest that you use a trigger when you need “real time” updates to your data warehouse and even then, the trigger should only write a record into an admin table that your ETL process is polling to get started.

A common design pattern with a sophisticated ETL is to do all of the ETL to a staging data warehouse and allow a check mechanism to verify that the ETL was successful. Once this check is performed (either by computer or by human, depending on the system), you can then push the data from the staging tables to the production tables. There is a SQL operator specifically for this process, the MERGE operator. MERGE looks at two tables and looks at their joins and compares the data and will do add, updates, inserts, and deletes to keep the two tables in sync. Here is how we would do that with a MERGE statement and our fact table from Part II.

--MERGE the FACT Local to FACT Remote
MERGE dbo.FactOrder as t--target
  USING Northwind.dwh.FactOrder as s--source
  ON t.ProductID = s.ProductID AND
  t.EmployeeID = s.EmployeeID AND
  t.ShipperID = s.ShipperID AND
  t.TimeKey = s.TimeKey --joining keys

--record exists in target, but data different
WHEN MATCHED AND (s.OrderDate != t.OrderDate OR 
   s.PostalCode != t.PostalCode OR
   s.[Total Sales] != t.[Total Sales] OR
   s.Discount != t.Discount OR
   s.[Unit Sales] != t.[Unit Sales]) THEN
       UPDATE SET t.OrderDate = s.OrderDate, 
       t.PostalCode = s.PostalCode,
       t.[Total Sales] = s.[Total Sales],
       t.Discount = s.Discount,
t.[Unit Sales] = s.[Unit Sales]

WHEN NOT MATCHED BY SOURCE THEN
   --only in target, get rid of it
   Delete

WHEN NOT MATCHED BY TARGET THEN 
   --record only exists in source
   INSERT VALUES (s.OrderDate, s.PostalCode, s.ProductID,s.EmployeeID,
   s.ShipperID, s.[Total Sales], s.Discount, s.[Unit Sales], s.[TimeKey]);--required!!

Next we will warp up with the application development options.

Published at DZone with permission of Stephen Forte, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Mark Unknown replied on Tue, 2012/12/04 - 2:30pm

As I said before, you can use SOA too - especially if you are writing a brand new system. Chances are other systems will want to integrate with your new one.  Why do something special for a DW?  

I am not sure what  "realtime ETL" is but it is only more costly than "nightly" in that it happens more often.

Read the chapter about ETL in  Dave Chappell's ESB book.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.