October 3, 2010 Leave a comment
When working with exceptionally large data warehouses, you invariably find yourself looking to improve the performance of your ETL everywhere you can. One of the easiest ways to save time is to bypass your staging area altogether. If you can accomplish all of the transformations in a single step, then why not deliver the results directly to the presentation area? In this scenario, surely moving your millions of source rows into a staging area just to turn around and move them again into the presentation area is a waste of time right?
There can be no doubt that time and system resources are consumed by the staging process, so why waste that time mucking about in the staging area? Quite simply: validation and recovery.
Occasionally it is possible to deliver the data, validate, and then retract the data if errors are found, but introduce a single attribute that is updatable and recovery becomes impossible (or extremely messy at best).
Fact tables are the best candidates for this thinking because most well-designed facts would be inserts only, so as long as you can identify the ETL run in which the fact record was inserted (typically with an inserted-date or a batch id) recovery is not too difficult.
Dimensions on the other hand are nearly impossible unless every single attribute in the dimension falls into one of the following two categories:
• Type-1 attributes that never change (arguably this is a Type-0 dimension).
• Type-2 SCD’s without any expiration dates.
Moreover, if you deliver a dimension, find an error, and recover (delete the newly inserted dimension records), you run the risk of invalidating the relationships between the dimension table and relevant facts that were also delivered.
The best practice, and one that I follow strictly, is to have empty copies of all the tables in the data warehouse mirrored in the staging area. Unlike the presentation area, these tables include relational constraints between the facts and dimensions, as well as between and dimensions and related bridge tables. If the database supports enabling and disabling relational constraints, the constraints start off disabled.
At the end of the transformation, records are delivered to the staging copies of the production tables. It may be necessary to pull relevant dimensional records from the presentation tables regardless of whether or not there are any changes to the dimensions. Once the staging copies are completely loaded, re-enable the constraints (if supported) and run any other validation tests you can develop to ensure the data is ready for presentation to the user. If, and only if, the staging area relational constraints are successfully re-enabled and the staged data has passed all of your validation tests should you deliver the records to the presentation area.
Kimball and Ross have often used a restaurant kitchen as the metaphor for the staging area and I’ve always liked the analogy. Delivering directly to the presentation area is much like backing the produce truck up to the front door and tossing the food right on the table. For food or data alike, the product you’re delivering is unwashed and may contain bugs.
An old joke asks the question: What’s worse than biting into an apple and finding a worm? The answer is; Biting into an apple and finding half a worm. So what’s worse than opening up a report and finding bad data? Opening up the report and finding half the bad data, or of course swallowing it whole.