April 18, 2011

Slowly Changing Dimensions

Slowly changing dimensions in dimensional modeling are commonly misunderstood.  Chapter 6 of The Data Warehouse Toolkit by Ralph Kimball describes slowly changing dimensions in detail.  This posting attempts to provide a summary explanation.

A slowly changing dimension is a dimension for which any of its attributes (column values) can change over time.  The attributes change infrequently.  Since the whole purpose of a dimension table is to aggregate on its attributes, there must be a systematic way of handling the changes.

For example, suppose there is a patient dimension with health plan information columns for each patient.  If a patient changes health plans, then the values for the health plan related columns will change.  For example:

 Medical Record #        Plan Name                  Plan ID             Effective Date
Before   5552223333                Tufts Health Plan          2233221122     1/1/2005
After      5552223333                BCBSMA                   4442227777     1/1/2009

To demonstrate the different approaches to making changes to slowly changing dimensions, we will create an example table DimPatient.  The primary key is PatientID, and the plan name, plan ID, and effective date are the slowly changing attributes. (Of course, in a real-world patient dimension there would probably be many more attributes.)

DimPatient (PatientID, MRN, PlanName, PlanID, EffDate)

Ralph Kimball proposes three types of slowly changing dimensions where the definition of each type is based on how the data is updated:

Type I:  Simply update the changed attributes.

Type II: Add a new dimension record with a new primary key and the new attribute values.

Type III: Create separate columns for the original and current values (e.g. OriginalPlanID, CurrentPlanID)

There are obvious advantages and disadvantages for each of Kimball’s types of slowly changing dimensions.

Type I Slowly Changing Dimension

The advantage of this type of slowly changing dimension is that it is the easiest to update and maintain.  For any given change, there is only one record that needs to be updated and that is identified by the primary key.

UPDATE DimPatient
SET PlanName = ‘BCBSMA’, PlanID = ‘4442227777’, EffDate = ‘1/1/2009’
WHERE PatientID = 111222333

The main disadvantage of a Type I slowly changing dimension is that history is lost.  Suppose there is a query to get amounts charged to insurance companies by year.  To demonstrate this we will include example tables FactVisits and DimDate.

FactVisit (FactVisitID, VisitDateID, VisitMinutes, ClinicianID, PatientID, ProblemID, CopayAmount, BillAmount)

DimDate (DateID, Year, Month, Fiscal Year, Quarter, BusinessDay)

A star join uses DimPatient to see which health plan was billed by year.

SELECT dd.Year, dp.HealthPlan, SUM(fv.BillAmount)
FROM FactVisit fv
JOIN DimDate dd ON dd.DateID = fv.VisitDateID
JOIN DimPatient dp ON dp.PatientID = fv.PatientID
GROUP BY dd.Year, dp.HealthPlan

Because a Type I slowly changing dimension looses history, all amounts billed for our example patient will show up billed to ‘BCBSMA’ even though amounts billed prior to 2009 were billed to another plan.  For this reason, Type I slowly changing dimensions are rarely useful in a dimensional model.

Type II Slowly Changing Dimension

This type of slowly changing dimension simply adds a new record with a new primary key.  All things considered, this is the most useful type of slowly changing dimension because the history is automatically partitioned, and nothing else needs to be done when creating star joins.

If the DimPatient dimension is implemented as a Type II slowly changing dimension, another record would simply be added when a patient’s health plan changes.

PatientID
MRN
PlanName
PlanID
EffDate
1000222
5552223333
Tufts Health Plan
2233221122
1/1/2005
2000444
5552223333
BCBSMA
4442227777
1/1/2009

Note that the new record has a different value in the primary key column PatientID.  The MRN is used to identify the patient and is unchanged.  The “slowly changing” attributes PlanName and PlanID contain the new values.  The effective date is for reference purposes only and does not need to be included in the join conditions.

In the previous query, the billing amounts for the example patient will show up billed to ‘Tufts Health Plan’ for all charges prior to 2009, and to ‘BCBSMA’ for all charges since the beginning of 2009.  In Kimball’s words:

The Type Two slowly changing dimension automatically partitions history and an application must not be required to place any time constraints on effective dates in the dimension.[1]

One with entity-relationship modeling background may protest that we are creating two records for the same entity with different primary keys (i.e. two PatientID’s).  However, for dimension tables the primary key uniquely identifies the combination of all attributes in the row.  The patient in this case is actually identified by the Medical Record Number (MRN) which will contain the same value in both records.

Type III Slowly Changing Dimension

This type would contain “original” and “current” columns for the changed attribute.  In our example, we would add the following columns to our dimension:

OriginalPlanName
CurrentPlanName
OriginalPlanID
CurrentPlanID

Although this may be satisfactory for some simple data mart applications, it is a rarely useful type of slowly changing dimension with several disadvantages. 

A Type III slowly changing dimension does not partition by time.  We couldn’t use effective date columns in queries because we would be adding additional qualifiers on the dimension to be used in a star join which is not correct design.  Even if we did use effective dates,  that would only give us the starting point for the current value. 

Also, adding more and more columns to the dimension could potentially get out of control.  Keep in mind that the whole purpose of dimensional modeling is to simplify queries.  Analysts and ETL developers would have to consider each changing attribute of each slowly changing dimension for every query.

Another disadvantage of a Type III slowly changing dimension is that it only allows for a single change to an attribute.  Suppose in our example that a patient changed health plans three or four times over the time window of the data.  A Type III slowly changing dimension would not work.

Using Changing Dimension Attributes in WHERE Clauses

Using changing attributes of slowly changing dimensions in a WHERE clause of a star join has the risk of either loosing data we want to include or including data that we don’t want to include.  When we query on a previous value of a dimension attribute, we loose new data, and when we query on a newer value, we loose previous data. 

For example, a dimension named DimDiagnosis contains diagnosis codes and descriptions.  The descriptions are slowly changing attributes.

DimDiagnosis (DiagnosisID, ICD9Code, Description)

Previous Values
DiagnosisID                 ICD9Code         Description
10000222                    401.1                Hypertension Benign

 New Values
DiagnosisID                 ICD9Code         Description
10000244                    401.1                Hypertension, benign

Note that the Description value has changed slightly.  If a WHERE clause condition on this dimension contains previous description, it will lose data for the new description.

WHERE dimDiagnosis.Description = ‘Hypertension Benign’

One way to address this problem is to include a column with a stable value.  For example, a column could be added to the DimDiagnosis dimension named DimDescription which would contain a value that doesn’t change when the Description column changes.

Previous Values
DiagnosisID                 ICD9Code         Description                  DimDescription
10000222                    401.1                Hypertension Benign  HYPERTENSION BENIGN

 New Values
DiagnosisID                 ICD9Code         Description                  DimDescription
10000244                    401.1                Hypertension, benign  HYPERTENSION BENIGN

Then the where clause of a query could use the unchanging attribute DimDescription

WHERE dimDiagnosis.DimDescription = ‘HYPERTENSION BENIGN’

The changed attributes can still be used in groupings and aggregations.

SELECT d.Description, f.VisitDate, COUNT(*)
FROM FactVisits v
JOIN DimDiagnosis d ON d.DiagnosisID = f.DiagnosisID
WHERE dimDiagnosis.DimDescription = ‘HYPERTENSION BENIGN’
GROUP BY d.Description, f.VisitDate

Note that the changed Description value is used in the grouping while the unchanged DimDescription column is used in the WHERE clause. 

Restraint should be used when adding additional columns to any dimension so as to avoid adding unnecessary complexity complicating queries rather than simplifying them.


[1][1] Kimball, Ralph, The Data Warehouse Toolkit. New York: John Wiley & Sons, Inc., 1996, p. 103

12 comments:

  1. It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.
    Data science Course Training in Chennai |Best Data Science Training Institute in Chennai
    matlab training chennai | Matlab course in chennai

    ReplyDelete
  2. Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
    Aws training chennai | AWS course in chennai
    Rpa training in chennai | RPA training course chennai

    ReplyDelete
  3. This is a great inspiring article.I am pretty much pleased with your good work. You put really very helpful information. Keep it up. Keep blogging. Looking to reading your next post..
    DevOps Training in Chennai

    DevOps Course in Chennai

    ReplyDelete
  4. Incredible article. I'm managing a portion of these issues too..
    best interiors

    ReplyDelete

  5. A persuading conversation is worth remark. I do accept that you ought to distribute more on this topic, it probably won't be an untouchable issue however typically individuals don't discuss such issues. To the following! Much obliged!!

    best interiors

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. yes i am fully decided on amid this text and that i simply indulgent pronounce that this article is deeply best and pretty informative article.i will make hermetically sealed to be studying your blog extra. You made a fine lessening but I can't seasoned occurring but surprise, what kind of the including together facet? !!!!!!thank you!!!!!!.. Fifa 19 Crack

    ReplyDelete
  8. thank you later than more for all of the knowledge you distribute,suitable pronounce. i was fantastically vivacious inside the article, it is pretty inspiring I ought to agree with on. Crack Download I as soon as touring you web site by means of now I always stumble upon appealing articles later this one.awesome task, I greatly admire that.Do preserve sharing! Regards, Norton Internet Security Product Key Generator

    ReplyDelete