Database | Bob's Business Intelligence Blog

Don’t Forget the Metadata

March 11, 2011 Leave a comment

There is an old adage that suggests that a carpenter’s house is always the last to get fixed. Too busy working on other people’s houses, he never seems to get around to working on his own. Similarly, it seems that a data warehouse developer’s data is often neglected as well. I have seen this in several data warehouses, where there was little or no metadata being populated. This has always surprised me because metadata is not difficult to develop.

The bare minimum for metadata should be a list of the tables in your model, and the last time that each table has been successfully updated. With this, you can let the users know how up to date the data warehouse is. This information can be added to the user’s reports, a portal page, or on a dashboard.

This takes care of the users, but what about your needs as a developer? For that you’ll also want to keep a history of the ETL process in your metadata. Now, before you tell me that your ETL tool takes care of this, take a look at exactly what it provides. I like to keep a record of each process in my ETL run, recording the situation at the beginning of the run (start time and number of rows), the number of inserts, updates, and deletes, and the situation at the end of the run (end time and number of rows).

In the case of the number of rows (before and after), I had one instance where this was invaluable in identifying an error in the DBA’s keyboard/chair interface. There was a reporting error caused by a single missing row in a dimension table. When I compared the starting number of rows to the previous days ending number of rows, they should have matched, but they didn’t. Since the table lost a row between ETL runs, that left human error as the cause.

So what do you store in your metadata? I’d love to hear your ideas.

Filed under Data Warehousing, Database Tagged with metadata

ETL from Source to Presentation in One Easy Step (Bad Idea)

October 3, 2010 Leave a comment

When working with exceptionally large data warehouses, you invariably find yourself looking to improve the performance of your ETL everywhere you can. One of the easiest ways to save time is to bypass your staging area altogether. If you can accomplish all of the transformations in a single step, then why not deliver the results directly to the presentation area? In this scenario, surely moving your millions of source rows into a staging area just to turn around and move them again into the presentation area is a waste of time right?

There can be no doubt that time and system resources are consumed by the staging process, so why waste that time mucking about in the staging area? Quite simply: validation and recovery.

Occasionally it is possible to deliver the data, validate, and then retract the data if errors are found, but introduce a single attribute that is updatable and recovery becomes impossible (or extremely messy at best).

Fact tables are the best candidates for this thinking because most well-designed facts would be inserts only, so as long as you can identify the ETL run in which the fact record was inserted (typically with an inserted-date or a batch id) recovery is not too difficult.

Dimensions on the other hand are nearly impossible unless every single attribute in the dimension falls into one of the following two categories:
• Type-1 attributes that never change (arguably this is a Type-0 dimension).
• Type-2 SCD’s without any expiration dates.
Moreover, if you deliver a dimension, find an error, and recover (delete the newly inserted dimension records), you run the risk of invalidating the relationships between the dimension table and relevant facts that were also delivered.

The best practice, and one that I follow strictly, is to have empty copies of all the tables in the data warehouse mirrored in the staging area. Unlike the presentation area, these tables include relational constraints between the facts and dimensions, as well as between and dimensions and related bridge tables. If the database supports enabling and disabling relational constraints, the constraints start off disabled.

At the end of the transformation, records are delivered to the staging copies of the production tables. It may be necessary to pull relevant dimensional records from the presentation tables regardless of whether or not there are any changes to the dimensions. Once the staging copies are completely loaded, re-enable the constraints (if supported) and run any other validation tests you can develop to ensure the data is ready for presentation to the user. If, and only if, the staging area relational constraints are successfully re-enabled and the staged data has passed all of your validation tests should you deliver the records to the presentation area.

Kimball and Ross have often used a restaurant kitchen as the metaphor for the staging area and I’ve always liked the analogy. Delivering directly to the presentation area is much like backing the produce truck up to the front door and tossing the food right on the table. For food or data alike, the product you’re delivering is unwashed and may contain bugs.

An old joke asks the question: What’s worse than biting into an apple and finding a worm? The answer is; Biting into an apple and finding half a worm. So what’s worse than opening up a report and finding bad data? Opening up the report and finding half the bad data, or of course swallowing it whole.

Filed under Data Warehousing, Database

A Stitch in “Time”

September 27, 2010 Leave a comment

Central to nearly every dimensional data model is one or more date
dimensions (aka time dimensions). The benefits of the date (or time) dimension have been discussed at length since the genesis of dimensional modeling in the 1990’s, so I won’t rehash them here. But for the uninitiated:

A date dimension is a table that contains a single row for each date represented in your fact table. In addition to the surrogate key and an actual date, they also contain various attributes of dates (ex: day of week, day of month, quarter name, month name, etc) as well as attributes for representing multiple calendars (ex: fiscal calendar, academic calendar, etc.). While it is true that many of these values can be calculated using built-in dbms date functions, these functions perform slowly and are different for each database vendor.

Relative Time Attributes

One set of attributes that I think are often overlooked are relative time attributes. Relative to what? Usually to the current date (either today or the last date that the fact tables were updated). Let’s start with the typical date dimension

dimDates

DateSKey: INTEGER [PK]

CalendarDate: DATE

DateDescription: VARCHAR(40)

Year: VARCHAR(4)

Month: VARCHAR(10)

(various other attributes)

Now if, from this table, I needed to filter on yesterday, it wouldn’t be too difficult to do a little math to find that date for yesterday.

SELECT
  *
FROM
  dimDates
WHERE
  CalendarDate = ( GetDate()-1 )

But what if I needed all the records from last week? How about last quarter? What if I wanted to compare last quarter to the same quarter from last year? Some of these calculations can be infuriatingly difficult, and don’t forget that the SQL necessary for SQL Server won’t be the same as the SQL for Oracle (or DB2 or mySQL…).

Enter; Relative time attributes. Let’s start with the most basic, RelativeCalendarDate:

dimDates

DateSKey: INTEGER [PK]

CalendarDate: DATE

DateDescription: VARCHAR(40)

Year: VARCHAR(4)

Month: VARCHAR(10)

(various other attributes)

RelativeCalendarDate: INTEGER

This integer column will be populated with a 0 for the current date, -1 for yesterday, -2 for the date before, and so on. For handling future dates, it would also have a 1 for tomorrow, 2 for the day after, and so on. So every row in the table will have a number representing the number of days from to day, negative going back in time, positive going forward.

You would then add similar relative attributes for every level of granularity you represent in the date dimension:

dimDates

DateSKey: INTEGER [PK]

CalendarDate: DATE

DateDescription: VARCHAR(40)

Year: VARCHAR(4)

Month: VARCHAR(10)

(various other attributes)

RelativeCalendarDate: INTEGER

RelativeCalendarWeek: INTEGER

RelativeCalendarMonth: INTEGER

RelativeCalendarQuarter: INTEGER

RelativeCalendarYear: INTEGER

RelativeFiscalYear: INTEGER

So, similar to RelativeCalendarDate, RelativeCalendarWeek would have 0 for all 7 rows in the current week, -1 for all 7 rows in last week, -2 for all 7 rows in the prior week, +1 for all 7 rows in next week, etc..

An Example

Assuming that today was July 2, 2010, your week starts on Mondays, and your fiscal calendar begins in July, your table would look something like this:

Calendar Date	Date Description	(var.)	Relative Calendar Date	Relative Calendar Week	Relative Calendar Month	Relative Calendar Quarter	Relative Calendar Year	Relative Fiscal Year
6/25/2010	Friday, June 25, 2010	…	-7	-1	-1	-1	0	-1
6/26/2010	Saturday, June 26, 2010	…	-6	-1	-1	-1	0	-1
6/27/2010	Sunday, June 27, 2010	…	-5	-1	-1	-1	0	-1
6/28/2010	Monday, June 28, 2010	…	-4	0	-1	-1	0	-1
6/29/2010	Tuesday, June 29, 2010	…	-3	0	-1	-1	0	-1
6/30/2010	Wednesday, June 30, 2010	…	-2	0	-1	-1	0	-1
7/1/2010	Thursday, July 01, 2010	…	-1	0	0	0	0	0
7/2/2010	Friday, July 02, 2010	…	0	0	0	0	0	0
7/3/2010	Saturday, July 03, 2010	…	1	0	0	0	0	0
7/4/2010	Sunday, July 04, 2010	…	2	0	0	0	0	0
7/5/2010	Monday, July 05, 2010	…	3	1	0	0	0	0
7/6/2010	Tuesday, July 06, 2010	…	4	1	0	0	0	0
7/7/2010	Wednesday, July 07, 2010	…	5	1	0	0	0	0
7/8/2010	Thursday, July 08, 2010	…	6	1	0	0	0	0
7/9/2010	Friday, July 09, 2010	…	7	1	0	0	0	0

With the addition of relative time attributes to your date dimension, comparing ranges of dates at any level of granularity becomes simple. For example, let’s compare last month to the same month from last year.

SELECT
  D.Year,
  D.Month,
  SUM(F.Measure1)
FROM
  FactTable F
  INNER JOIN dimDates D ON F.DateSKey = D.DateSKey
WHERE
  D.RelativeCalendarMonth IN (-1, -13)
GROUP BY
  D.Year,
  D.Month

Keeping Them Up-To-Date

Of course, to make the relative dates useful, they’ll need to be updated every day. Below we see a SQL statement that will take care of this for SQL Server. You can run this on a nightly schedule, preferably right at midnight if your warehouse is available 24 hours a day. This SQL can be adapted to work with any DBMS.

UPDATE
  dimDates
SET
  RelativeCalendarYear = DATEDIFF(yyyy, GETDATE(), CalendarDate), -- RelativeYears
  RelativeCalendarQuarter = DATEDIFF(q, GETDATE(), CalendarDate), -- RelativeQuarters
  RelativeCalendarMonth = DATEDIFF(m, GETDATE(), CalendarDate), -- RelativeMonths
  RelativeCalendarWeek = DATEDIFF(ww, GETDATE(), CalendarDate), -- RelativeWeeks
  RelativeCalendarDay = DATEDIFF(d, GETDATE(), CalendarDate) -- RelativeDays
WHERE
  DateSKey > 0

You’ll note the Where clause at the end. In my date dimensions, I have rows with negative surrogate key values which represent such things as “No Date” (eliminating the need for nulls), “Unknown Date” (another replacement for null values) and “Invalid Date” (because some transactional systems I’ve pulled from allow 31
days in February). If you don’t have rows with negative surrogate keys, then the Where clause wouldn’t be necessary for you.

Things To Try

The concept should work well for even more granular time dimensions. If shifts are part of your time dimension, or even hours, then you should be able to create relative time attributes for these periods of time as well.
I had considered adding named attributes as well (ex. RelativeCalendarDate = -1, RelativeCalendarDateName = “Yesterday”), but I was unable to come up with a comprehensive naming scheme. If you can think of one, please let me know.

Conclusion

As you can see, relative time attributes are easy to add to any data warehouse and are simple to maintain. They’ll reduce your SQL development time when working with dates, and in many cases they’ll improve the performance of your queries. They are not entirely intuitive, however with a brief explanation, your query developers will pick up the concept easily.

Filed under Data Warehousing, Database

Bob's Business Intelligence Blog

Don’t Forget the Metadata

ETL from Source to Presentation in One Easy Step (Bad Idea)

A Stitch in “Time”

Relative Time Attributes

An Example

Keeping Them Up-To-Date

Things To Try

Conclusion

Recent Posts

Archive

Tags