February 7, 2014

Introduction to Data Warehousing is a PowerPoint presentation.  If you have PowerPoint, download this file to view.

May 1, 2012

Back to Basics

Click the link below for review of the basics of the relational database theory.  This PDF document is unfinished and will be expanded in the near future.


March 21, 2012

What's in the Data?

Data warehouses are often built with some analytical applications in mind beforehand.  The content and quality of the data sources for the data warehouse are often secondary considerations.  Consequently, data warehouse users may be disappointed at the limitations of the data and frustrated at its inability to fulfill analytical and report requests.  A very important difference between data warehouse data and operational data is that operational data is designed for its intended purpose--a data warehouse is built from data that was designed for other purposes. 

For example, data for processing medical claims is specifically designed to contain the data elements required by a health plan to validate the claim and pay the healthcare provider. Downstream analyses are not necessarily considered in the design.  Much later the claims data is imported into a data warehouse, and someone attempts to use it for an epidemiological study.  Perhaps certain data elements that would be useful for the study are just not there.  Missing values that didn't matter when processing claims could skew the study results.  Maybe data from multiple sources couldn't be consistently integrated.  The data to answer some important questions are just not there.

It may be a bit of a gamble building a data warehouse without knowing exactly what you can and cannot do with it in advance.  However, that gamble can be mitigated by some careful research and analysis of data sources before investing.  Data dictionaries and/or metadata should be requested from the data providers.  Large samples of the data should be acquired so that the quality and population of relevant variables can be analyzed.  Investigate how consistent data is from multiple sources and to what extent they can be integrated.

Data architects and developers should work closely with the analysts to provide technical support for this advance work. They can assist with importing sample data, creating schemas, writing various queries, and performing QC.  The logical architecture for the data warehouse should not be attempted until there is considerable understanding of the source data.

However, don't limit the possibilities inherent in the data to preconceived applications alone. W.H. Inman has called data warehousing a heuristic process--a process of discovery.  For perceptive and creative users, there is usually a lot to be discovered.  Once data is understood, many unforeseen applications may emerge.  There may be many answers just waiting for the right questions.  Don't limit yourself or your organization.  Be open to the endless possibilities a well-designed, data rich data warehouse can provide over time.

November 20, 2011

Rapidly Emerging Technology Series: Cloud Databases

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals.  This posting discusses cloud databases.

A growing trend in database systems is storing data in the cloud. The cloud metaphor is taken from the flowchart symbol of a cloud that represents data movement over the Internet (hence not visible to the flowchart….in the cloud).  Companies including Microsoft, Oracle, IBM and Amazon offer cloud storage services on vast server and storage complexes. 

Advantages of storing data with a cloud service is that they will perform all administrative services, replicate data on various severs for instantly available backups, and automatically scale growing databases and applications.  These services could result in considerable cost savings for some organizations.  Database development can be done without a large commitment to hardware purchases and administrative staff, and purchase and hiring decisions can be postponed for later assessment.

For example, Microsoft offers a product called SQL Azure which is a cloud-based SQL Server service.  SQL Azure makes it possible to store and access SQL Server data on cloud-based servers as well as integrate cloud and local data in a hybrid database.  SQL Azure databases can contain SQL Server tables, indexes, stored procedures, and other common database objects.

Cloud servers can be public or private.  A public cloud is a service purchased from a cloud services vendor.  A private cloud is set up by an organization with its own servers and storage.  A public cloud has the advantage of very sophisticated administrative and scaling capabilities that would be very complicated and expensive for smaller organizations to set up themselves.  A private cloud may be used when regulatory requirements are needed above and beyond the security provided by public clouds. 

A cloud hybrid database is one where data is stored on both cloud and local servers. This may be done for a number of reasons including replication of critical data on separate servers, separating processing of various data sources, providing separate access to various data marts, and other data distribution and access requirements.

Rapidly Emerging Technology Series: Self-Service BI

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals.  This posting discusses self-service BI.

A very rapidly evolving area of information access is called Self-service Business Intelligence or Self-service BI.  Business users want more flexible access to information than standard reports can provide and not have to rely on analysts who can write code.

For many years, companies like Cognos and MicroStrategies have provided BI tools that work on top of relational databases.  Typically, the database team would set up a star schema data mart, and BI developers would create a browser-based front end would allow easy grouping and aggregation (slice-and-dice) access to the data with sums and counts.  BI products also provide tools to build dashboards that give very current views of data—typically in an operational data store (ODS).  Dashboards can also include data visualization in the form of graphs and gauges.

However, these tools are evolving to include much more sophisticated access for users.  Unlike older BI tools that sat on top of a pre-defined data mart, newer self-service BI products provide access to multiple data sources including enterprise databases as well as publicly accessible databases and subscription databases on the web.  The BI tool allows for integration of these diverse data sources at the user level for complex reports and graphs.

Most analyses involve more than single queries.  There is usually a discovery process which leads to more questions and subsequently more analyses.  Newer BI products support this discovery process.  Self-service BI allows an analysis to be built upon by easily changing variables, adding data sources, re-use of results, and so on.

Newer BI tools are being developed that also allow mobile access from a variety of devices. Users will increasingly need to access and visualize data from their smart phones and tablets.

Microsoft has invested heavily in Self-service BI.  Their SQL Server product contains tools that allow tight integration with Excel and all of its analytical and graphing capabilities as well as with SharePoint to promote various types of live data access throughout an organization.  SQL Server 2012 which will be released in coming months has a tool called PowerView which is a very flexible and powerful BI front end.  It supports integration of diverse data sources at the user level, self-service alerts, sophisticated graphing, document authoring with live data updates, and many other analytical tools.

(A pre-release version of SQL Server 2012 is available for free download at http://www.microsoft.com/sqlserver/en/us/get-sql-server/try-it.aspx.)

Rapidly Emerging Technology Series: Unstructured Databases

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals.  This posting discusses unstructured databases.

When data grows to many terabytes, exabytes, and zettabytes, it can no longer be managed or queried using traditional relational databases.  Examples are the massive amounts of data managed by companies like Google or Facebook.  Examples of data that can occupy massive amounts of storage in health care are genomics data and data from electronic patient monitoring systems.  Because of the massive amounts of data in these databases, they cannot be structured according to relational database models and are hence called unstructured databases.  The term big data is frequently used to describe unstructured databases.

A pioneer in developing new ways of managing big data was Google.  They created a framework for distributing and processing huge amounts of data over many nodes including a storage system called Google File System, and a language called MapReduce for querying distributed data.

A popular big data solution based on Google’s is called Hadoop.  Versions of Hadoop have been adapted by Yahoo, Facebook, and Amazon.  Hadoop is being incorporated into commercial relational database products, and it is freely available as open source software.  Both Microsoft and Oracle have announced Hadoop integration with their newly released relational database products.

When large data providers Yahoo and Facebook implemented Hadoop, they added higher level programming tools than MapReduce called Hive (Facebook) and Pig (Yahoo).  These languages mimic SQL but have the ability to distribute queries on distributed nodes.  Because big data databases use non-relational queries, they are frequently referred to as NoSQL databases.

Unstructured databases are very good at quickly storing and accessing extremely large amounts of data.  However, they do not have the more complete functionality of relational database management systems.  For example, Hadoop does not support complex data models, complex analytical queries, referential integrity, and other RDBMS capabilities.  NoSQL databases only provide very simple data organization and simple text-based querying.

Rapidly Emerging Technology Series: Massively Parallel Processing

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals.  This posting discusses massively parallel processing.

Parallel processing distributes a process into multiple threads so that they can be performed simultaneously.  In database systems, parallel processing allows queries to be performed on data distributed in different locations at the same time and then combine the results.

Database systems with massively parallel processing (MPP) can distribute processes across hundreds of nodes.  A node is a separate server with its own software and storage.  A very large table could be distributed across the nodes so that a query can process against all of the nodes at the same time.  The results are then combined and returned to the requestor.

MPP systems have a controller node which does the work of distributing data and processes.  A developer could create DDL for a table without consideration of how the data would be distributed across the system, and the controller would automatically allocate storage on each of the nodes for this table.  When inserting data, the system has the intelligence to hash the data and create a balanced distribution.  When querying the data, the controller instantly converts the query into code that would run against all of the nodes simultaneously.

MPP databases only make sense for data warehousing and OLAP.  There would be no performance gains from the MPP when performing OLTP operations such as inserting or updating individual records or querying for small sets of records.

Rapidly Emerging Technology Series: Column-Oriented Databases

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals.  This posting discusses column-oriented databases.

Data in a typical relational database is organized in tables by row.  The row paradigm is used for physical storage as well as the logical organization of data. 

Column-Oriented or column-based databases physically organize data by column while still presenting data as rows.  Organizing data by column makes it self-indexing since each column becomes the equivalent of an index key.  Each column has table and row identifiers so that columns can be combined to produce rows of data in a table.  Any column can be used in the WHERE clause of a query, the data elements can quickly be found in that column, and columns with the same row identifiers are assembled for the query results.

Microsoft SQL Server 2012 (to be released in 2012) has a new feature called Columnstore Indexes which provide column-oriented storage for tables.  Microsoft claims that these can speed up queries up to 10 times.  A major advantage of the new SQL Server Columnstore Indexes is that you can index all columns in a large fact table.  This enables very fast performance when joining foreign keys in very large fact tables to primary keys in dimensions for star joins.  With B-tree indexes, there are penalties for indexing too many columns.  In a typical fact table, most columns other than the primary key are foreign keys to dimension tables, and Columnstore indexes will allow all of those columns to be indexed using column-oriented storage.

Oracle 11g supports column-oriented storage of data but only on its proprietary Exadata storage servers or Exadata appliances.

Informative links for information about column-oriented Databases:


Rapidly Emerging Technology Series: Database Appliances

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals.  This posting discusses database appliances.

A database appliance is an integrated, preconfigured package of RDBMS software and hardware.  Most major database vendors including Microsoft, Oracle, IBM, and TeraData package and sell database appliances.  Data warehouse appliances are the biggest selling database appliances.

Database systems utilize memory, I/O, processing cores, and storage for database processes, and they need to formulate execution plans that will utilize these resources efficiently.  Hardware configurations for database performance—particularly data warehousing—are not necessarily the same as configurations for other purposes.  In fact, sometimes database performance isn’t even considered when purchasing and configuring hardware.  In those situations, even the most experienced DBA's and systems administrators aren't always able to optimize systems to get satisfactory performance.

A database appliance is a pre-configured hardware and software solution.  Most database appliances are designed for specialized applications such as OLTP or data warehousing. The servers, storage, OS, and RDBMS software are integrated and optimized for performance.

Some database appliances utilize parallel processing to distribute workloads across server nodes. Multi-node systems can be share-everything allowing multiple servers to share storage, or share-nothing where each server has its own storage.  Share-everything systems tend to be more expensive yet allow the same data to be accessed by several servers. Share-nothing systems can distribute data from the same tables across multiple nodes so that queries can be processed in parallel.  A share-nothing system is useful for querying very large fact tables in a data warehouse.

Database appliances generally do not scale well outside of the initial configuration.  For example, you generally don’t add storage to a database appliance.  Data warehouse appliances are available to support from about 5 terabytes to 100’s of terabytes of data.

Database appliances can also be very costly.  In many situations, it may be possible to get satisfactory database performance with much less expensive hardware purchases.


Wikipedia article on Data Warehouse Appliances: http://en.wikipedia.org/wiki/Data_warehouse_appliance

September 19, 2011

The Dimensional Model

The dimensional model is a way of organizing data to facilitate queries and analyses.  When implemented correctly, queries are simplified and errors are less likely.  Dimensional modeling has very specific requirements without a lot of room for ambiguity.  The Data Warehouse Toolkit by Ralph Kimball is a definitive primer on dimensional modeling. This posting attempts to give a brief overview.

 A dimensional model consists of fact tables and dimension tables in what is called a star schema.  The star metaphor is derived from the typical graphic of a fact table in the center surrounded by multiple dimensions like points on a star.



The fact table typically represents a transaction, measure or event while dimension tables contain descriptive attributes for the fact.  For example, a prescription dispensing could exist in a fact table with the cost of the prescription being the actual “fact,” while dimensions could contain descriptive information about the patient, the pharmacy, the health plan, and the prescriber.  Selection and groupings can be done on any attribute of the dimension.  The patient dimension could contain DOB for selecting or grouping on age, gender for selecting or grouping by sex and zip code for selecting or grouping by location.

This is not unlike common joins of transaction data with reference data to obtain descriptive attributes for the transactions.  However, dimensions can contain descriptions at multiple levels or hierarchies that would have been normalized in transaction processing.  For example, a drug dimension table can contain several levels of drug classifications, each of which can be used for selection or grouping in a query.
Fact Tables
fact is an entity, event, transaction, or measure that can be represented as a single row in a relational database table.  It has a primary key consisting of a single column which uniquely identifies the fact.  A fact table is always Third Normal Form (3NF).

The other columns of a fact table can contain one of three types of values:
  • A single descriptive value which is intransitively functionally dependent on the primary key.
  • A single-column foreign key which is intransitively functionally dependent on the primary key.
  • A numeric value containing the actual measure or fact.   For example, the money amount of a transaction or the cost of product.
 The phrase intransitively functionally dependent is from relational database theory.  It means that the column can only contain one possible value which is determined by the primary key and is not dependent on the values of any other columns.

An example of a fact is a record of a patient visit to a doctor’s office.  (This is a simplified example used to explain dimensional modeling and is not likely to be complete enough for a real world application.)  The visit fact table is named FactVisit.  

FactVisit (FactVisitID, VisitDateID, VisitMinutes, ClinicianID, PatientID, ProblemID, CopayAmount, BillAmount)

FactVisitID is the primary key and uniquely identifies the patient visit.

VisitDateID is a foreign key referencing the primary key in a date dimension.  The date dimension contains various date attributes such as quarter, fiscal year, and so on.  (Date dimensions are explained later.)

VisitMinutes is a value containing the number of minutes of the patient visit. It is a descriptive attribute of the fact.  It is not a foreign key to a dimension table because it is simply a value with no additional attributes of its own.  A descriptive column with no additional attributes that doesn’t reference a dimension table is called a degenerative dimension.

ClinicianID is a foreign key referencing the primary key in a clinician dimension containing attributes of doctors and nurse practitioners at the practice.

PatientID is a foreign key referencing the primary key in a patient dimension containing attributes of patients.

ProblemID is a foreign key that references the primary key in a problem dimension which contains various problem/diagnosis attributes.

CopayAmount and BillAmount are the facts.  These are the patient’s copay amount and the amount billed to the patient’s insurance respectively.

Note that each of the attribute columns is one of the three types of attributes previously described.  Again, it is important that there are no transitive dependencies.  For example, if there was a ProblemID and a DiagnosisID, then there may be a dependency between these two columns because a specific problem as defined by the practice may be assigned one or more diagnosis codes.
Dimensions
Dimensions are values that have multiple descriptive attributes.  An easy to understand example is a date dimension.  For any given date, there are multiple attributes such as Year, Month, Day of Week, Fiscal Year, Quarter, Weekend vs. Weekday, whether the date is a holiday, and so on.  For example:

DimDate (DateID, Date, Year, Month, Fiscal Year, Quarter, BusinessDay)

FactVisit could be joined to DimDate to obtain attributes of the date.

SELECT d.Year, d.Month, f.PatientID, COUNT(*) AS Visits
FROM factVisits f JOIN dimDate d ON d.DateID = f.VisitDateID
GROUP BY d.Year, d.Month, f.PatientID

This query returns the number of patient visits by year, month, and patient. Notice that we are not grouping on the actual date, but on the Year and Month attributes of the date in the dimension table.

A dimension is Second Normal Form (2NF) because transitive dependencies are allowed.  For example, the Quarter is transitively dependent on Month (i.e. the Month can be used to determine the Quarter).  If the data modeler chooses to further normalize a dimension table (i.e. to 3NF), then the star schema becomes a snowflake schema.  Kimball recommends “resisting the urge to snowflake” because of the unnecessary complexity and additional performance issues.  The fact that dimensions are not further normalized makes it simple to group and aggregate which is what dimensional modeling is all about.

Note that the DimDate dimension table can be used by any other fact table that needs to reference date attributes.  A dimension that can be used by multiple facts in multiple star schemas is called a conformed dimension.
Primary and Foreign Key Relationships
It is very important that any foreign key in a fact joins on the primary key in a dimension to return exactly one row from the dimension. Otherwise there would be duplication and/or erroneous sums and counts caused by the join. 

When assigning the dimension key as a foreign key in a fact, all details of the dimension must be correct for that individual fact record.  To demonstrate this requirement we will use a dimension named DimProblem.

DimProblem (ProblemID, ICD9, Snomed, ShortDescription, LongDescription)

DimProblem
ProblemID
ICD9
Snomed
ShortDescription
LongDescription
100223
401.0
10020781
Hypertension
Hypertension, malignant
100224
401.1
10015489
Hypertension
Hypertension, benign
100225
401.9
10015489
Hypertension
Hypertension, Unspecified

ICD9 and Snomed are standardized coding systems for medical diagnoses. If a patient visit is for hypertension, then there could be multiple ICD9 codes and Snomed codes for the problem associated with the visit.  To simply know that the problem is hypertension for a given record in a fact table is not enough.  The other details such as specific ICD9 and Snomed codes must also be known before inserting the fact record so that the correct ProblemID will be used.

Many-to-One Relationships
In the above example there are multiple ICD9 and Snomed values for one ShortDescription. A data modeler with a relational database modeling background may be tempted to normalize or snowflake the dimension to handle the many-to-one relationships.  However, a correctly designed star schema requires that a record in the DimProblem table contains not one record for a given problem description, but rather one record for the unique combination of all columns.

In this case, facts that are associated with all of the relevant ICD9 or Snomed values can be selected or grouped using the ShortDescription.

SELECT f.ClinicianID, f.PatientID, COUNT(*) AS BPVisits
FROM FactVisit f JOIN DimProblem d ON f.ProblemID = d.ProblemID
WHERE d.ShortDescription = ‘Hypertension’
GROUP BY f.ClinicianID, f.PatientID

In the above query, all ICD9 and Snomed codes with ShortDescription = ‘Hypertension’ will be selected.

Another common design error is to create the dimension so that there would have to be further join conditions on the dimension record.  For example, if the dimension key in a fact table references more than one record in the dimension table, then an additional qualifier in the WHERE clause would have to be added to a query to make sure that the right dimension row gets used.  As a rule, there shouldn’t be any additional qualifiers for a dimension other than the primary-foreign key relationship with the fact.  A given fact record should never be able to join on multiple records in a given dimension.
Star Joins
A star join is simply a set of joins of the foreign keys in the fact table with primary keys in the dimensions in order to select or group on the various attributes in the dimensions.

A full star join which utilizes the full star schema could look like this.

SELECT dc.ClinicianName,
       dpt.PatientName,
       dpr.ShortDescription AS Problem,
       SUM(fv.Copay) AS TotalCopays,
       SUM(fv.BillAmount) As TotalBilled
FROM FactVisit fv
JOIN DimClinicians dc ON dc.ClinicianID = fv.ClinicianID
JOIN DimPatients dpt ON dpt.PatientID = fv.PatientID
JOIN DimProblems dpr ON dpr.ProblemID = fv.ProblemID
WHERE dpr.ShortDescription = ‘Hypertension’
GROUP BY dc.ClinicName, dpt.PatientName, dpr.ShortDescription

Developers and architects must be cautious of performance issues that could result from star joins on multiple dimensions.  This is especially true if a dimension is very large.  For example, in the above query there could be several million rows in DimPatient joining tens of millions of rows in FactVisits.  Since star joins are typically on upwards of 4 or 5 tables, the performance issues increase significantly.
Non-Dimensional Tables
When Kimball described the dimensional model, it was before the way many modern enterprise data warehouses are generally architected today.  The star schemas in his examples that he called “data warehouses” would be called “data marts by many data warehouse architects”   A data warehouse typically contains large denormalized history table and reference data that are the data sources for data marts, and these tables are rarely dimensional model.

Data in Operational Data Stores (ODS) also may not be in the dimensional model because the time it takes to build or update a star schema may not allow near-real-time data requirements of an ODS.  Other non-dimensional tables can include data used in statistical analyses (e.g. SAS data sets) that requires different structures.

Typically, dimensional modeling and star schema are only used in data marts that support specific applications.  The dimensional tables and non-dimensional table types can be organized and easily identified using separate schemas and consistent naming conventions.

April 18, 2011

Slowly Changing Dimensions

Slowly changing dimensions in dimensional modeling are commonly misunderstood.  Chapter 6 of The Data Warehouse Toolkit by Ralph Kimball describes slowly changing dimensions in detail.  This posting attempts to provide a summary explanation.

A slowly changing dimension is a dimension for which any of its attributes (column values) can change over time.  The attributes change infrequently.  Since the whole purpose of a dimension table is to aggregate on its attributes, there must be a systematic way of handling the changes.

For example, suppose there is a patient dimension with health plan information columns for each patient.  If a patient changes health plans, then the values for the health plan related columns will change.  For example:

 Medical Record #        Plan Name                  Plan ID             Effective Date
Before   5552223333                Tufts Health Plan          2233221122     1/1/2005
After      5552223333                BCBSMA                   4442227777     1/1/2009

To demonstrate the different approaches to making changes to slowly changing dimensions, we will create an example table DimPatient.  The primary key is PatientID, and the plan name, plan ID, and effective date are the slowly changing attributes. (Of course, in a real-world patient dimension there would probably be many more attributes.)

DimPatient (PatientID, MRN, PlanName, PlanID, EffDate)

Ralph Kimball proposes three types of slowly changing dimensions where the definition of each type is based on how the data is updated:

Type I:  Simply update the changed attributes.

Type II: Add a new dimension record with a new primary key and the new attribute values.

Type III: Create separate columns for the original and current values (e.g. OriginalPlanID, CurrentPlanID)

There are obvious advantages and disadvantages for each of Kimball’s types of slowly changing dimensions.

Type I Slowly Changing Dimension

The advantage of this type of slowly changing dimension is that it is the easiest to update and maintain.  For any given change, there is only one record that needs to be updated and that is identified by the primary key.

UPDATE DimPatient
SET PlanName = ‘BCBSMA’, PlanID = ‘4442227777’, EffDate = ‘1/1/2009’
WHERE PatientID = 111222333

The main disadvantage of a Type I slowly changing dimension is that history is lost.  Suppose there is a query to get amounts charged to insurance companies by year.  To demonstrate this we will include example tables FactVisits and DimDate.

FactVisit (FactVisitID, VisitDateID, VisitMinutes, ClinicianID, PatientID, ProblemID, CopayAmount, BillAmount)

DimDate (DateID, Year, Month, Fiscal Year, Quarter, BusinessDay)

A star join uses DimPatient to see which health plan was billed by year.

SELECT dd.Year, dp.HealthPlan, SUM(fv.BillAmount)
FROM FactVisit fv
JOIN DimDate dd ON dd.DateID = fv.VisitDateID
JOIN DimPatient dp ON dp.PatientID = fv.PatientID
GROUP BY dd.Year, dp.HealthPlan

Because a Type I slowly changing dimension looses history, all amounts billed for our example patient will show up billed to ‘BCBSMA’ even though amounts billed prior to 2009 were billed to another plan.  For this reason, Type I slowly changing dimensions are rarely useful in a dimensional model.

Type II Slowly Changing Dimension

This type of slowly changing dimension simply adds a new record with a new primary key.  All things considered, this is the most useful type of slowly changing dimension because the history is automatically partitioned, and nothing else needs to be done when creating star joins.

If the DimPatient dimension is implemented as a Type II slowly changing dimension, another record would simply be added when a patient’s health plan changes.

PatientID
MRN
PlanName
PlanID
EffDate
1000222
5552223333
Tufts Health Plan
2233221122
1/1/2005
2000444
5552223333
BCBSMA
4442227777
1/1/2009

Note that the new record has a different value in the primary key column PatientID.  The MRN is used to identify the patient and is unchanged.  The “slowly changing” attributes PlanName and PlanID contain the new values.  The effective date is for reference purposes only and does not need to be included in the join conditions.

In the previous query, the billing amounts for the example patient will show up billed to ‘Tufts Health Plan’ for all charges prior to 2009, and to ‘BCBSMA’ for all charges since the beginning of 2009.  In Kimball’s words:

The Type Two slowly changing dimension automatically partitions history and an application must not be required to place any time constraints on effective dates in the dimension.[1]

One with entity-relationship modeling background may protest that we are creating two records for the same entity with different primary keys (i.e. two PatientID’s).  However, for dimension tables the primary key uniquely identifies the combination of all attributes in the row.  The patient in this case is actually identified by the Medical Record Number (MRN) which will contain the same value in both records.

Type III Slowly Changing Dimension

This type would contain “original” and “current” columns for the changed attribute.  In our example, we would add the following columns to our dimension:

OriginalPlanName
CurrentPlanName
OriginalPlanID
CurrentPlanID

Although this may be satisfactory for some simple data mart applications, it is a rarely useful type of slowly changing dimension with several disadvantages. 

A Type III slowly changing dimension does not partition by time.  We couldn’t use effective date columns in queries because we would be adding additional qualifiers on the dimension to be used in a star join which is not correct design.  Even if we did use effective dates,  that would only give us the starting point for the current value. 

Also, adding more and more columns to the dimension could potentially get out of control.  Keep in mind that the whole purpose of dimensional modeling is to simplify queries.  Analysts and ETL developers would have to consider each changing attribute of each slowly changing dimension for every query.

Another disadvantage of a Type III slowly changing dimension is that it only allows for a single change to an attribute.  Suppose in our example that a patient changed health plans three or four times over the time window of the data.  A Type III slowly changing dimension would not work.

Using Changing Dimension Attributes in WHERE Clauses

Using changing attributes of slowly changing dimensions in a WHERE clause of a star join has the risk of either loosing data we want to include or including data that we don’t want to include.  When we query on a previous value of a dimension attribute, we loose new data, and when we query on a newer value, we loose previous data. 

For example, a dimension named DimDiagnosis contains diagnosis codes and descriptions.  The descriptions are slowly changing attributes.

DimDiagnosis (DiagnosisID, ICD9Code, Description)

Previous Values
DiagnosisID                 ICD9Code         Description
10000222                    401.1                Hypertension Benign

 New Values
DiagnosisID                 ICD9Code         Description
10000244                    401.1                Hypertension, benign

Note that the Description value has changed slightly.  If a WHERE clause condition on this dimension contains previous description, it will lose data for the new description.

WHERE dimDiagnosis.Description = ‘Hypertension Benign’

One way to address this problem is to include a column with a stable value.  For example, a column could be added to the DimDiagnosis dimension named DimDescription which would contain a value that doesn’t change when the Description column changes.

Previous Values
DiagnosisID                 ICD9Code         Description                  DimDescription
10000222                    401.1                Hypertension Benign  HYPERTENSION BENIGN

 New Values
DiagnosisID                 ICD9Code         Description                  DimDescription
10000244                    401.1                Hypertension, benign  HYPERTENSION BENIGN

Then the where clause of a query could use the unchanging attribute DimDescription

WHERE dimDiagnosis.DimDescription = ‘HYPERTENSION BENIGN’

The changed attributes can still be used in groupings and aggregations.

SELECT d.Description, f.VisitDate, COUNT(*)
FROM FactVisits v
JOIN DimDiagnosis d ON d.DiagnosisID = f.DiagnosisID
WHERE dimDiagnosis.DimDescription = ‘HYPERTENSION BENIGN’
GROUP BY d.Description, f.VisitDate

Note that the changed Description value is used in the grouping while the unchanged DimDescription column is used in the WHERE clause. 

Restraint should be used when adding additional columns to any dimension so as to avoid adding unnecessary complexity complicating queries rather than simplifying them.


[1][1] Kimball, Ralph, The Data Warehouse Toolkit. New York: John Wiley & Sons, Inc., 1996, p. 103

March 8, 2011

Partitioning Data Warehouse Data

Introduction
W.H. Inman describes partitioning in data warehousing as "the breakup of data into separate physical units that can be handled independently." [1]    For very large data sets, rows of data of the same type are separated into subsets called partitions.  Partitions provide advantages for both database administration and for query performance.

Partitioning and Database Administration
Database administration tasks such as moving, replicating, exporting, and archiving data can be greatly facilitated by partitioning.  An obvious fact is that it is easier to work with several small units than one massive unit. 

For example, if a data warehouse contains a rolling 5-year window of sales order data, each month the the oldest partition (older than current month minus 5 years) can be archived and dropped without affecting the other partitions.  No massive deletes.  No index rebuilds.  No regeneration of system statistics. Similarly, new partitions can be created and new incoming data can be inserted into those partitions without affecting the indexing or the system statistics of the older partitions.

Other partitioning features that can facilitate database administration include the ability to easily relocate partitions to different file systems and disks.  For example, Oracle allows a DBA to easily move data between partitions and individual tables and between different tablespaces.

Partitioning and Query Performance
For data warehouse developers and users, partitioning is very important for improving the performance of queries and ETL against very large data sets.  Queries can be run against separate partitions, and in modern RDBMS systems such as Oracle and SQL Server, intelligence is built-in so that queries against very large data sets can select only the relevant partitions. 

Before partitioning technology was introduced in Oracle 8, data warehouses frequently broke very large data sets into separate tables.  For example, instead of a very large SalesOrders table containing several years of data, smaller tables each containing one month of data could be created.  Tables could be named SalesOrders199901, SalesOrders199902, SalesOrders199903, and so on, with the YYYYMM portion of the name indicating which month of data the table contains.  Queries could be run against each monthly table (i.e. partition) for the relevant date ranges, and the separate results could be gathered with UNION operators.  For example:
  
SELECT * FROM SalesOrders199901 WHERE RegionCode = 5572
UNION
SELECT * FROM SalesOrders199902 WHERE  RegionCode = 5572
UNION
SELECT * FROM SalesOrders199903 WHERE  RegionCode = 5572 ;

Alternatively, each result could be inserted into a temporary table, and the temporary tables can then be gathered with  UNION operators.  That may reduce system memory management requirements and further improve performance.

Either way, results would be returned much more quickly by querying only the relevant months rather than querying a very large table containing multiple years of data.  This performance advantage would be realized in complex real world queries and over many partitions as well. 

Now that Oracle Enterprise Edition includes partitioning capabilities, single tables can now be created containing multiple partitions.  For example, instead of using separate monthly sales order tables as in the example above, the data could be put into one large partitioned table named SalesOrder.  The SalesOrder table could contain multiple partitions such as P199901, P199902, P199903, and so on based on the OrderDate column. Partition P199901 would contain only data where OrderDate is in January, 1999, P199902 would contain February 1999 data, and so on.  If a query against the SalesOrder table contains OrderDate in the WHERE clause, only the relevant partitions would be queried.  Oracle calls this partition pruning.

SELECT * FROM SalesOrders WHERE OrderDate BETWEEN '01/01/1999' AND '03/31/1999'

For the above query, only the P199901, P199902, and P19993 partitions would be queried.  Without partition pruning, many years of data may have to be scanned to find data with the appropriate date range.  For queries that return large numbers of rows in the results, partition pruning provides much better query performance than B-tree indexes.

Frequently, the column value on which a table is partitioned is not useful in the query.  Perhaps an entire 5 years of data needs to be queried, or perhaps there are other selection criteria in the WHERE clause that have nothing to do with the column used for partitioning.  In these cases, better performance can be gained by writing the query against each partition. 

SELECT * FROM SalesOrders PARTITION P199901 WHERE RegionCode = 5572 AND OrderType = 'V'
UNION
SELECT * FROM SalesOrders PARTITION P199902 WHERE RegionCode = 5572 AND OrderType = 'V'
UNION
SELECT * FROM SalesOrders PARTITION P199903 WHERE RegionCode = 5572 AND OrderType = 'V'

Notice that this query is very much like writing against separate tables (i.e. SalesOrders199901, SalesOrders199102, and SalesOrders199903).  One of the main advantages of the partitioning feature in an RDBMS is to simplify queries, but this example demonstrates that that doesn't always turn out to be the case.

In addition to partition pruning, performance is sometimes improved by the use of parallel processing.  Queries written to use parallelism can query multiple partitions at the same time.  Even better performance can be gained by parallel processes against partitions stored on separate disks.

Partitioning with Oracle
Oracle has rich partitioning features including multiple partition types (range, list, hash), automatic partition pruning, and subpartitioning.  

S
ubpartitioning is simply the partitioning of partitions.  A subpartition can be created using a different column than the main partition.  For example, the SalesOrder table which is partitioned on OrderDate can contain a subpartition on RegionCode which would further improve performance on the example query above.

 Partitions are created using the CREATE TABLE syntax when creating a table. 
Partitions can be altered, dropped , renamed, truncated, or moved using the ALTER TABLE syntax. Data can be exchanged between partitions and other tables also using the ALTER TABLE syntax.

Partitioning with SQL Server
SQL Server implements partitioning differently.  Initially, SQL Server implemented partitioning as partitioned views.  This method utilized separate tables for each partition, but those tables are brought together in a view with UNION operations. First you create the tables (i.e. partitions) with the CREATE TABLE syntax.

CREATE TABLE SalesOrders199901 
(OrderID int, CustomerID int, PartID int, OrderDate datetime, OrderAmt money,
            CONSTRAINT CK_OrderDate199901 CHECK (OrderDate BETWEEN '01/01/1999' AND '01/31/1999') ,
            CONSTRAINT PK_SalesOrders19991 PRIMARY KEY (OrderID, OrderDate)         )

CREATE TABLE SalesOrders199902 
(OrderID int, CustomerID int, PartID int, OrderDate datetime, OrderAmt money,
            CONSTRAINT CK_OrderDate199902 CHECK (OrderDate BETWEEN '02/01/1999' AND '02/28/1999') ,
            CONSTRAINT PK_SalesOrders19992 PRIMARY KEY (OrderID, OrderDate)         )

CREATE TABLE SalesOrders199903 
(OrderID int, CustomerID int, PartID int, OrderDate datetime, OrderAmt money,
            CONSTRAINT CK_OrderDate199903 CHECK (OrderDate BETWEEN '03/01/1999' AND '03/31/1999') ,
            CONSTRAINT PK_SalesOrders19993 PRIMARY KEY (OrderID, OrderDate)         )

Then the partitioned view is created on the range tables using CREATE VIEW.

CREATE VIEW SalesOrders WITH SCHEMABINDING AS
SELECT OrderID, CustomerID, PartID, OrderDate, OrderAmt FROM SalesOrders199901
UNION ALL
SELECT OrderID, CustomerID, PartID, OrderDate, OrderAmt FROM SalesOrders199902
UNION ALL
SELECT OrderID, CustomerID, PartID, OrderDate, OrderAmt FROM SalesOrders199903

The view becomes an updateable view.  Inserted data will automatically go into the correct partitions, and queries using OrderDate in the WHERE clause will automatically select only the relevant partitions.
In SQL Server 2005, real
table partitioning was introduced, but it is awkward and more difficult to implement than partitioning in Oracle.  Table partitioning in SQL Server is done in three steps.
·                     Create a partition function which defines the range or list values for each partition (CREATE PARTITION FUNCTION).
·                     Create a partition scheme which maps partition to file groups on disk (CREATE PARTITION SCHEME).
·                     Create a table that specifies the partition scheme and which columns are used in the partitioning (CREATE TABLE).

The complexity of implementing table partitioning in SQL Server is unfortunate and probably deters widespread use.  Partitioned views are currently easier to implement and manage in SQL Server.

Summary
Partitioning data into smaller sets of rows is useful if not essential for large data warehouses and analytical databases.  Partitioning can simplify location and movement of data for database administrators.  It provides significant performance gains for queries and ETL processes especially when strategically placing partitions across file systems and when used in parallel processing.  Oracle provides more extensive partitioning capabilities which are relatively easiy to implement.  Partitioning in SQL Server has less features and is more difficult to implement.  However, partitioned views are easy to implement in SQL Server and can provide most of the advantages of partitioning.


[1] W.H. Inman, Building the Data Warehouse, Second Edition (New York: John Wiley & Sons, Inc., 1996), 55.