July 22, 2010

Modeling the Data Warehouse

If Ralph Kimball were to publish a new revision of his book The Data Warehouse Toolkit, he should globally replace the words “data warehouse” with “data mart.”  When he wrote this book in the early 90’s, the nomenclature we use today wasn’t fully clarified.  This has caused a lot of confusion in the modeling of data warehouses.

Kimball advocated the use of dimensional modeling which utilizes mostly star schemas for use in analytical applications.  To put it simply, star schema supports groupings on various dimensions for summarization and reporting.  For example, a fact table could contain a transaction date while a date dimension table would contain various attributes for that date including calendar year, fiscal year, quarter, day-of-week, or whatever.  Queries could join on this dimension to summarize on these various date attributes.

The problem is that in modern data warehouses, we typically don’t directly query the main data warehouse.  Rather we query various data marts that are derived from the data warehouse.   In most cases, it makes absolutely no sense to use star schema when modeling a data warehouse.  When data can consist upwards of a billion records, joining on multiple dimension tables is usually not feasible.

There is no "one size fits all" that is appropriate for any given data warehouse.  However, partitioned tables with partitioning and subpartitioning on the columns most frequently used for loading data warehouses usually makes sense.  Data warehouse tables can also be denormalized to avoid the performance issues of joins on very large tables when loading data marts.

Of course, modeling a data warehouse involves a lot including various best practices and business-specific requirements.   Following are some of the main considerations when modeling a data warehouse:
  • Data should contain detail.  Summarization can be done at the data mart level where appropriate.  You can never “un-summarize” data, so the main data repository should contain as much detail as possible.
  • The data warehouse should be modeled on a time basis in most cases.  Data is non-volatile and reflects transactions that have happened at a point in time.  There may be adjustment transaction records included, but there are not adjustments to data per say.
  • The data warehouse should contain as much history as possible.  A large time window allows for more analytical possibilities such as trend analyses and historical reporting.
  • The data warehouse should be inclusive—not exclusive.  There should be no integrity constraints, and records should not be excluded because of bad or missing values.   Record exclusion can be done at the data mart level.  There may be records that have missing or bad values in some columns that can still be useful in some unforeseen analysis or application.  There are even cases where analyses can be skewed because of up-front data exclusion.
  • The data warehouse must be designed for performance.  Populating data marts or creating other output from the data warehouse could involve millions or even billions of records.  B-tree indexes are typically not used since they are only useful for selecting small subsets of records.  Partitioning is crucial and should be based on record selection in queries—not necessarily how the business logically organizes data.  Limited use of pre-joins (i.e. denormalization) is acceptable in a data warehouse.  Data is non-volatile, so data anomalies caused by updates are not a problem.  The use of pre-joins helps to avoid joins on extremely large tables when populating data marts.  

Data marts which are created by extracting data from the main data warehouse comprise the data that is utilized by users and applications.  These can be designed to meet business requirements.  They can use exclusion criteria and summarize data.   It is entirely appropriate to build data marts using dimensional modeling and star schema.  However, keep in mind that although star schema is extremely powerful and useful, it is not appropriate for all applications and analyses.  Always work closely with users and stakeholders when designing data marts.

2 comments: