Clarity in Data Warehousing: The Dimensional Model

The dimensional model is a way of organizing data to facilitate queries and analyses. When implemented correctly, queries are simplified and errors are less likely. Dimensional modeling has very specific requirements without a lot of room for ambiguity. The Data Warehouse Toolkit by Ralph Kimball is a definitive primer on dimensional modeling. This posting attempts to give a brief overview.

A dimensional model consists of fact tables and dimension tables in what is called a star schema. The star metaphor is derived from the typical graphic of a fact table in the center surrounded by multiple dimensions like points on a star.

The fact table typically represents a transaction, measure or event while dimension tables contain descriptive attributes for the fact. For example, a prescription dispensing could exist in a fact table with the cost of the prescription being the actual “fact,” while dimensions could contain descriptive information about the patient, the pharmacy, the health plan, and the prescriber. Selection and groupings can be done on any attribute of the dimension. The patient dimension could contain DOB for selecting or grouping on age, gender for selecting or grouping by sex and zip code for selecting or grouping by location.

This is not unlike common joins of transaction data with reference data to obtain descriptive attributes for the transactions. However, dimensions can contain descriptions at multiple levels or hierarchies that would have been normalized in transaction processing. For example, a drug dimension table can contain several levels of drug classifications, each of which can be used for selection or grouping in a query.

Fact Tables

A fact is an entity, event, transaction, or measure that can be represented as a single row in a relational database table. It has a primary key consisting of a single column which uniquely identifies the fact. A fact table is always Third Normal Form (3NF).

The other columns of a fact table can contain one of three types of values:

A single descriptive value which is intransitively functionally dependent on the primary key.
A single-column foreign key which is intransitively functionally dependent on the primary key.
A numeric value containing the actual measure or fact. For example, the money amount of a transaction or the cost of product.

The phrase intransitively functionally dependent is from relational database theory. It means that the column can only contain one possible value which is determined by the primary key and is not dependent on the values of any other columns.

An example of a fact is a record of a patient visit to a doctor’s office. (This is a simplified example used to explain dimensional modeling and is not likely to be complete enough for a real world application.) The visit fact table is named FactVisit.

FactVisit (FactVisitID, VisitDateID, VisitMinutes, ClinicianID, PatientID, ProblemID, CopayAmount, BillAmount)

FactVisitID is the primary key and uniquely identifies the patient visit.

VisitDateID is a foreign key referencing the primary key in a date dimension. The date dimension contains various date attributes such as quarter, fiscal year, and so on. (Date dimensions are explained later.)

VisitMinutes is a value containing the number of minutes of the patient visit. It is a descriptive attribute of the fact. It is not a foreign key to a dimension table because it is simply a value with no additional attributes of its own. A descriptive column with no additional attributes that doesn’t reference a dimension table is called a degenerative dimension.

ClinicianID is a foreign key referencing the primary key in a clinician dimension containing attributes of doctors and nurse practitioners at the practice.

PatientID is a foreign key referencing the primary key in a patient dimension containing attributes of patients.

ProblemID is a foreign key that references the primary key in a problem dimension which contains various problem/diagnosis attributes.

CopayAmount and BillAmount are the facts. These are the patient’s copay amount and the amount billed to the patient’s insurance respectively.

Note that each of the attribute columns is one of the three types of attributes previously described. Again, it is important that there are no transitive dependencies. For example, if there was a ProblemID and a DiagnosisID, then there may be a dependency between these two columns because a specific problem as defined by the practice may be assigned one or more diagnosis codes.

Dimensions

Dimensions are values that have multiple descriptive attributes. An easy to understand example is a date dimension. For any given date, there are multiple attributes such as Year, Month, Day of Week, Fiscal Year, Quarter, Weekend vs. Weekday, whether the date is a holiday, and so on. For example:

DimDate (DateID, Date, Year, Month, Fiscal Year, Quarter, BusinessDay)

FactVisit could be joined to DimDate to obtain attributes of the date.

SELECT d.Year, d.Month, f.PatientID, COUNT(*) AS Visits

FROM factVisits f JOIN dimDate d ON d.DateID = f.VisitDateID

GROUP BY d.Year, d.Month, f.PatientID

This query returns the number of patient visits by year, month, and patient. Notice that we are not grouping on the actual date, but on the Year and Month attributes of the date in the dimension table.

A dimension is Second Normal Form (2NF) because transitive dependencies are allowed. For example, the Quarter is transitively dependent on Month (i.e. the Month can be used to determine the Quarter). If the data modeler chooses to further normalize a dimension table (i.e. to 3NF), then the star schema becomes a snowflake schema. Kimball recommends “resisting the urge to snowflake” because of the unnecessary complexity and additional performance issues. The fact that dimensions are not further normalized makes it simple to group and aggregate which is what dimensional modeling is all about.

Note that the DimDate dimension table can be used by any other fact table that needs to reference date attributes. A dimension that can be used by multiple facts in multiple star schemas is called a conformed dimension.

Primary and Foreign Key Relationships

It is very important that any foreign key in a fact joins on the primary key in a dimension to return exactly one row from the dimension. Otherwise there would be duplication and/or erroneous sums and counts caused by the join.

When assigning the dimension key as a foreign key in a fact, all details of the dimension must be correct for that individual fact record. To demonstrate this requirement we will use a dimension named DimProblem.

DimProblem (ProblemID, ICD9, Snomed, ShortDescription, LongDescription)

DimProblem

ProblemID	ICD9	Snomed	ShortDescription	LongDescription
100223	401.0	10020781	Hypertension	Hypertension, malignant
100224	401.1	10015489	Hypertension	Hypertension, benign
100225	401.9	10015489	Hypertension	Hypertension, Unspecified

ICD9 and Snomed are standardized coding systems for medical diagnoses. If a patient visit is for hypertension, then there could be multiple ICD9 codes and Snomed codes for the problem associated with the visit. To simply know that the problem is hypertension for a given record in a fact table is not enough. The other details such as specific ICD9 and Snomed codes must also be known before inserting the fact record so that the correct ProblemID will be used.

Many-to-One Relationships

In the above example there are multiple ICD9 and Snomed values for one ShortDescription. A data modeler with a relational database modeling background may be tempted to normalize or snowflake the dimension to handle the many-to-one relationships. However, a correctly designed star schema requires that a record in the DimProblem table contains not one record for a given problem description, but rather one record for the unique combination of all columns.

In this case, facts that are associated with all of the relevant ICD9 or Snomed values can be selected or grouped using the ShortDescription.

SELECT f.ClinicianID, f.PatientID, COUNT(*) AS BPVisits

FROM FactVisit f JOIN DimProblem d ON f.ProblemID = d.ProblemID

WHERE d.ShortDescription = ‘Hypertension’

GROUP BY f.ClinicianID, f.PatientID

In the above query, all ICD9 and Snomed codes with ShortDescription = ‘Hypertension’ will be selected.

Another common design error is to create the dimension so that there would have to be further join conditions on the dimension record. For example, if the dimension key in a fact table references more than one record in the dimension table, then an additional qualifier in the WHERE clause would have to be added to a query to make sure that the right dimension row gets used. As a rule, there shouldn’t be any additional qualifiers for a dimension other than the primary-foreign key relationship with the fact. A given fact record should never be able to join on multiple records in a given dimension.

Star Joins

A star join is simply a set of joins of the foreign keys in the fact table with primary keys in the dimensions in order to select or group on the various attributes in the dimensions.

A full star join which utilizes the full star schema could look like this.

SELECT dc.ClinicianName,

dpt.PatientName,

dpr.ShortDescription AS Problem,

SUM(fv.Copay) AS TotalCopays,

SUM(fv.BillAmount) As TotalBilled

FROM FactVisit fv

JOIN DimClinicians dc ON dc.ClinicianID = fv.ClinicianID

JOIN DimPatients dpt ON dpt.PatientID = fv.PatientID

JOIN DimProblems dpr ON dpr.ProblemID = fv.ProblemID

WHERE dpr.ShortDescription = ‘Hypertension’

GROUP BY dc.ClinicName, dpt.PatientName, dpr.ShortDescription

Developers and architects must be cautious of performance issues that could result from star joins on multiple dimensions. This is especially true if a dimension is very large. For example, in the above query there could be several million rows in DimPatient joining tens of millions of rows in FactVisits. Since star joins are typically on upwards of 4 or 5 tables, the performance issues increase significantly.

Non-Dimensional Tables

When Kimball described the dimensional model, it was before the way many modern enterprise data warehouses are generally architected today. The star schemas in his examples that he called “data warehouses” would be called “data marts by many data warehouse architects” A data warehouse typically contains large denormalized history table and reference data that are the data sources for data marts, and these tables are rarely dimensional model.

Data in Operational Data Stores (ODS) also may not be in the dimensional model because the time it takes to build or update a star schema may not allow near-real-time data requirements of an ODS. Other non-dimensional tables can include data used in statistical analyses (e.g. SAS data sets) that requires different structures.

Typically, dimensional modeling and star schema are only used in data marts that support specific applications. The dimensional tables and non-dimensional table types can be organized and easily identified using separate schemas and consistent naming conventions.

Clarity in Data Warehousing

September 19, 2011

The Dimensional Model

No comments:

Post a Comment