Don’t Forget the Metadata
March 11, 2011 Leave a comment
There is an old adage that suggests that a carpenter’s house is always the last to get fixed. Too busy working on other people’s houses, he never seems to get around to working on his own. Similarly, it seems that a data warehouse developer’s data is often neglected as well. I have seen this in several data warehouses, where there was little or no metadata being populated. This has always surprised me because metadata is not difficult to develop.
The bare minimum for metadata should be a list of the tables in your model, and the last time that each table has been successfully updated. With this, you can let the users know how up to date the data warehouse is. This information can be added to the user’s reports, a portal page, or on a dashboard.
This takes care of the users, but what about your needs as a developer? For that you’ll also want to keep a history of the ETL process in your metadata. Now, before you tell me that your ETL tool takes care of this, take a look at exactly what it provides. I like to keep a record of each process in my ETL run, recording the situation at the beginning of the run (start time and number of rows), the number of inserts, updates, and deletes, and the situation at the end of the run (end time and number of rows).
In the case of the number of rows (before and after), I had one instance where this was invaluable in identifying an error in the DBA’s keyboard/chair interface. There was a reporting error caused by a single missing row in a dimension table. When I compared the starting number of rows to the previous days ending number of rows, they should have matched, but they didn’t. Since the table lost a row between ETL runs, that left human error as the cause.
So what do you store in your metadata? I’d love to hear your ideas.