ETL from Source to Presentation in One Easy Step (Bad Idea)

When working with exceptionally large data warehouses, you invariably find yourself looking to improve the performance of your ETL everywhere you can.  One of the easiest ways to save time is to bypass your staging area altogether.   If you can accomplish all of the transformations in a single step, then why not deliver the results directly to the presentation area?  In this scenario, surely moving your millions of source rows into a staging area just to turn around and move them again into the presentation area is a waste of time right?

There can be no doubt that time and system resources are consumed by the staging process, so why waste that  time mucking about in the staging area? Quite simply: validation and recovery.

Occasionally it is possible to deliver the data, validate, and then retract the data if errors are found, but introduce a single attribute that is updatable and recovery becomes impossible (or extremely messy at best).

Fact tables are the best candidates for this thinking because most well-designed facts would be inserts only, so as long as you can identify the ETL run in which the fact record was inserted (typically with an inserted-date or a batch id) recovery is not too difficult.

Dimensions on the other hand are nearly impossible unless every single attribute in the dimension falls into one of the following two categories:
•    Type-1 attributes that never change (arguably this is a Type-0 dimension).
•    Type-2 SCD’s without any expiration dates.
Moreover, if you deliver a dimension, find an error, and recover (delete the newly inserted dimension records), you run the risk of invalidating the relationships between the dimension table and relevant facts that were also delivered.

The best practice, and one that I follow strictly, is to have empty copies of all the tables in the data warehouse mirrored in the staging area.  Unlike the presentation area, these tables include relational constraints between the facts and dimensions, as well as between and dimensions and related bridge tables.  If the database supports enabling and disabling relational constraints, the constraints start off disabled.

At the end of the transformation, records are delivered to the staging copies of the production tables.  It may be necessary to pull relevant dimensional records from the presentation tables regardless of whether or not there are any changes to the dimensions.  Once the staging copies are completely loaded, re-enable the constraints (if supported) and run any other validation tests you can develop to ensure the data is ready for presentation to the user.  If, and only if, the staging area relational constraints are successfully re-enabled and the staged data has passed all of your validation tests should you deliver the records to the presentation area.

Kimball and Ross have often used a restaurant kitchen as the metaphor for the staging area and I’ve always liked the analogy.  Delivering directly to the presentation area is much like backing the produce truck up to the front door and tossing the food right on the table.  For food or data alike, the product you’re delivering is unwashed and may contain bugs.

An old joke asks the question:  What’s worse than biting into an apple and finding a worm?  The answer is; Biting into an apple and finding half a worm.  So what’s worse than opening up a report and finding bad data? Opening up the report and finding half the bad data, or of course swallowing it whole.

A Stitch in “Time”

Central to nearly every dimensional data model is one or more date
dimensions (aka time dimensions).  The benefits of the date (or time) dimension have been discussed at length since the genesis of dimensional modeling  in the 1990’s, so I won’t rehash them here. But for the uninitiated:

A date dimension is a table that contains a single row for each date represented in your fact table.  In addition to the surrogate key and an actual date, they also contain various attributes of dates (ex: day of week, day of month, quarter name, month name, etc) as well as attributes for representing multiple calendars (ex: fiscal calendar, academic calendar, etc.).  While it is true that many of these values can be calculated using built-in dbms date functions, these functions perform slowly and are different for each database vendor.

Relative Time Attributes

One set of attributes that I think are often overlooked are relative time attributes.  Relative to what? Usually to the current date (either today or the last date that the fact tables were updated).  Let’s start with the typical date dimension

dimDates

DateSKey: INTEGER [PK]
CalendarDate: DATE
DateDescription: VARCHAR(40)
Year: VARCHAR(4)
Month: VARCHAR(10)
(various other attributes)

Now if, from this table, I needed to filter on yesterday, it wouldn’t be too difficult to do a little math to find that date for yesterday.

SELECT
  *
FROM
  dimDates
WHERE
  CalendarDate = ( GetDate()-1 )

But what if I needed all the records from last week? How about last quarter?  What if I wanted to compare last quarter to the same quarter from last year?  Some of these calculations can be infuriatingly difficult, and don’t forget that the SQL necessary for SQL Server won’t be the same as the SQL for Oracle (or DB2 or mySQL…).

Enter; Relative time attributes. Let’s start with the most basic, RelativeCalendarDate:

dimDates

DateSKey: INTEGER [PK]
CalendarDate: DATE
DateDescription: VARCHAR(40)
Year: VARCHAR(4)
Month: VARCHAR(10)
(various other attributes)
RelativeCalendarDate: INTEGER

This integer column will be populated with a 0 for the current date, -1 for yesterday, -2 for the date before, and so on.  For handling future dates, it would also have a 1 for tomorrow, 2 for the day after, and so on. So every row in the table will have a number representing the number of days from to day, negative going back in time, positive going forward.

You would then add similar relative attributes for every level of granularity you represent in the date dimension:

dimDates

DateSKey: INTEGER [PK]
CalendarDate: DATE
DateDescription: VARCHAR(40)
Year: VARCHAR(4)
Month: VARCHAR(10)
(various other attributes)
RelativeCalendarDate: INTEGER
RelativeCalendarWeek: INTEGER
RelativeCalendarMonth: INTEGER
RelativeCalendarQuarter: INTEGER
RelativeCalendarYear: INTEGER
RelativeFiscalYear: INTEGER

So, similar to RelativeCalendarDate, RelativeCalendarWeek would have 0 for all 7 rows in the current week, -1 for all 7 rows in last week, -2 for all 7 rows in the prior week, +1 for all 7 rows in next week, etc..

An Example

Assuming that today was July 2, 2010, your week starts on Mondays, and your fiscal calendar begins in July, your table would look something like this:

Calendar

Date

Date
Description
(var.) Relative

Calendar

Date

Relative

Calendar

Week

Relative

Calendar

Month

Relative

Calendar

Quarter

Relative

Calendar

Year

Relative

Fiscal

Year

6/25/2010 Friday,
June 25, 2010
-7 -1 -1 -1 0 -1
6/26/2010 Saturday,
June 26, 2010
-6 -1 -1 -1 0 -1
6/27/2010 Sunday, June 27, 2010 -5 -1 -1 -1 0 -1
6/28/2010 Monday, June 28, 2010 -4 0 -1 -1 0 -1
6/29/2010 Tuesday, June 29, 2010 -3 0 -1 -1 0 -1
6/30/2010 Wednesday,
June 30, 2010
-2 0 -1 -1 0 -1
7/1/2010 Thursday, July 01, 2010 -1 0 0 0 0 0
7/2/2010 Friday, July 02, 2010 0 0 0 0 0 0
7/3/2010 Saturday, July 03, 2010 1 0 0 0 0 0
7/4/2010 Sunday, July 04, 2010 2 0 0 0 0 0
7/5/2010 Monday, July 05, 2010 3 1 0 0 0 0
7/6/2010 Tuesday, July 06, 2010 4 1 0 0 0 0
7/7/2010 Wednesday, July 07, 2010 5 1 0 0 0 0
7/8/2010 Thursday, July 08, 2010 6 1 0 0 0 0
7/9/2010 Friday, July 09, 2010 7 1 0 0 0 0

With the addition of relative time attributes to your date dimension, comparing ranges of dates at any level of granularity becomes simple. For example, let’s compare last month to the same month from last year.

SELECT
  D.Year,
  D.Month,
  SUM(F.Measure1)
FROM
  FactTable F
  INNER JOIN dimDates D ON F.DateSKey = D.DateSKey
WHERE
  D.RelativeCalendarMonth IN (-1, -13)
GROUP BY
  D.Year,
  D.Month

Keeping Them Up-To-Date

Of course, to make the relative dates useful, they’ll need to be updated every day.  Below we see a SQL statement that will take care of this for SQL Server. You can run this on a nightly schedule, preferably right at midnight if your warehouse is available 24 hours a day.  This SQL can be adapted to work with any DBMS.

UPDATE
  dimDates
SET
  RelativeCalendarYear = DATEDIFF(yyyy, GETDATE(), CalendarDate), -- RelativeYears
  RelativeCalendarQuarter = DATEDIFF(q, GETDATE(), CalendarDate), -- RelativeQuarters
  RelativeCalendarMonth = DATEDIFF(m, GETDATE(), CalendarDate), -- RelativeMonths
  RelativeCalendarWeek = DATEDIFF(ww, GETDATE(), CalendarDate), -- RelativeWeeks
  RelativeCalendarDay = DATEDIFF(d, GETDATE(), CalendarDate) -- RelativeDays
WHERE
  DateSKey > 0

You’ll note the Where clause at the end.  In my date dimensions, I have rows with negative surrogate key values which represent such things as “No Date” (eliminating the need for nulls), “Unknown Date”  (another replacement for null values) and “Invalid Date” (because some transactional systems I’ve pulled from allow 31
days in February).  If you don’t have rows with negative surrogate keys, then the Where clause wouldn’t be necessary for you.

Things To Try

  • The concept should work well for even more granular time dimensions.  If shifts are part of your time dimension, or even hours, then you should be able to create relative time attributes for these periods of time as well.
  • I had considered adding named attributes as well (ex. RelativeCalendarDate = -1, RelativeCalendarDateName = “Yesterday”), but I was unable to come up with a comprehensive naming scheme. If you can think of one, please let me know.

Conclusion

As you can see, relative time attributes are easy to add to any data warehouse and are simple to maintain. They’ll reduce your SQL development time when working with dates, and in many cases they’ll improve the performance of your queries. They are not entirely intuitive, however with a brief explanation, your query developers will pick up the concept easily.