March 21, 2012

What's in the Data?

Data warehouses are often built with some analytical applications in mind beforehand.  The content and quality of the data sources for the data warehouse are often secondary considerations.  Consequently, data warehouse users may be disappointed at the limitations of the data and frustrated at its inability to fulfill analytical and report requests.  A very important difference between data warehouse data and operational data is that operational data is designed for its intended purpose--a data warehouse is built from data that was designed for other purposes. 

For example, data for processing medical claims is specifically designed to contain the data elements required by a health plan to validate the claim and pay the healthcare provider. Downstream analyses are not necessarily considered in the design.  Much later the claims data is imported into a data warehouse, and someone attempts to use it for an epidemiological study.  Perhaps certain data elements that would be useful for the study are just not there.  Missing values that didn't matter when processing claims could skew the study results.  Maybe data from multiple sources couldn't be consistently integrated.  The data to answer some important questions are just not there.

It may be a bit of a gamble building a data warehouse without knowing exactly what you can and cannot do with it in advance.  However, that gamble can be mitigated by some careful research and analysis of data sources before investing.  Data dictionaries and/or metadata should be requested from the data providers.  Large samples of the data should be acquired so that the quality and population of relevant variables can be analyzed.  Investigate how consistent data is from multiple sources and to what extent they can be integrated.

Data architects and developers should work closely with the analysts to provide technical support for this advance work. They can assist with importing sample data, creating schemas, writing various queries, and performing QC.  The logical architecture for the data warehouse should not be attempted until there is considerable understanding of the source data.

However, don't limit the possibilities inherent in the data to preconceived applications alone. W.H. Inman has called data warehousing a heuristic process--a process of discovery.  For perceptive and creative users, there is usually a lot to be discovered.  Once data is understood, many unforeseen applications may emerge.  There may be many answers just waiting for the right questions.  Don't limit yourself or your organization.  Be open to the endless possibilities a well-designed, data rich data warehouse can provide over time.

1 comment:

  1. Excellent ! I am truly impressed that there is so much about this subject that has been revealed and you did it so nicely.
    Data Warehousing Training in Chennai

    ReplyDelete