Clarity in Data Warehousing

February 5, 2011

Operational Data Stores (ODS)

Introduction
An operational data store (ODS) is an architectural component of a data warehouse that is used for immediate reporting with current operational data. An ODS contains lightly transformed and lightly integrated operational data with a short time window. It is used for real time and near real time reporting.

Unlike data marts, an ODS is not refreshed from the data warehouse history tables. Rather it is directly loaded from operational data, staging area, or incoming files. It can optionally serve as a data source for the data warehouse.

Click on any image to see full size.

An ODS must be frequently refreshed so that it contains very current data. An ODS can be updated daily, hourly, or even immediately after transactions on operational data.

Transformation and Integration
The update frequency and currency of the data in the ODS is directly related to the amount of transformation and integration that is performed on the data. Choices are made by the development team based on how long various transformation processes will take to complete vs. how current the data must be for reporting. For example, if real time reports or dashboards require data within minutes or even seconds after data events, it may not be possible to do time-consuming transformation or integration processing. Also, dimensions and other reference data in the data warehouse may not be as current as new operational data.

Some degree of transformation and/or integration is usually required for reports. Bill Inman defines five ODS classes[1]. The classes represent different levels of ease and speed of refresh vs. the degree of integration and transformation. For example, a Class I ODS would simply consist of direct replication of operational data (no transformation), where a Class V ODS would consist of highly integrated and aggregated data (highly transformed). A Class I ODS would be the quickest and simplest to refresh, while a Class V ODS would involve the most complex, time-consuming processing.

Refresh Options
One option for refreshing an ODS with very current data is to use the transaction logs of the operational data to update data replicated in the ODS. Database systems use transaction logging to record all updates, inserts, and deletes to tables. Transaction logs are normally used for rolling back invalid transactions or for applying changes to data that were not completed due to a system failure. However, if the tables are replicated in the ODS, the transaction logs can also be used to refresh them. The replicated tables could then be used as staging for reporting tables. In SQL Server and in Oracle using transaction logs to update replicated data is called Change Data Capture (CDC).[2] Those products provide system stored procedures and other tools to assist in applying transaction logging to replicated data.

Another option for keeping current data in an ODS is the use of indexed views[3] (SQL Server) or materialized views[4] (Oracle). Indexed views and materialized views are similar to regular views except that they provide high performance when the view is queried. These high-performance views created in the ODS schema point at operational data. Views can provide a relatively simple way to keep an ODS very current. Views cannot be used when it is necessary to access data on remote servers.

File system mirroring works at a file system level and uses mirrored data to refresh the ODS. For example, EMC’s proprietary Business Control Volume (BCV) disconnects the mirror so that it contains a snapshot of data at a given point in time. After the snapshot has been used to update the ODS, it is reconnected to the source data and brought back into synch by the system.

Update triggers can be created on operational data tables to write to the ODS whenever the data is updated. Triggers are infrequently used for ODS refresh because they require modification of the operational data to add the trigger code. Also, because triggers can affect performance, they are not feasible when there is high volume transaction processing.

When an ODS needs to be updated less frequently, conventional ETL processes can be used. For example, if an ODS is only updated once daily, it may be feasible to export operational data to files which can then be quickly loaded into the ODS.

ODS Tables
An ODS can contain its own staging tables as well as transformed tables for reporting. It is common for an ODS to utilize separate staging tables from the rest of the data warehouse because the ODS is refreshed by separate processes than the data warehouse.

Reporting tables are limited by the fact that there may not be as full integration and transformation as there is for data warehouse tables. This is because time consuming ETL processing may not allow for fast access to very current data. Because of this, dimensional modeling (star schema) is not always possible.

Reporting
The main purpose of an ODS is to provide reporting and querying on very current operational data. Reports can only be created on very short time windows of data. To query history, the data warehouse and application-specific data marts must be used instead of the ODS.

Reports may also be affected by the limited transformation and integration of data in an ODS. For example, data may not contain the surrogate keys necessary for joins on dimensions in the data warehouse. Data from multiple sources may not be completely integrated into consistent structures and attribute values.

Summary
An ODS can provide access to current operational data for reporting. An ODS is loaded directly from operational data and not from the data warehouse history tables. An ODS only contains a short time window of data. If history is required, the ODS can be a data source for the data warehouse.

An ODS must balance the frequency of refresh, the degree of transformation and integration, and how current data in the ODS must be. A number of refresh options should be considered based on application requirements.

An ODS can be a database or a schema within a database that contains both staging as well as reporting tables. There may be limitations on the reporting tables compared to data mart tables due to the fact that there is less transformation and integration of operational data.

[1] http://www.information-management.com/issues/20000101/1749-1.html

[2] http://msdn.microsoft.com/en-us/library/cc645937.aspx

[3] http://technet.microsoft.com/en-us/library/cc917715.aspx

[4] http://en.wikipedia.org/wiki/Materialized_view

July 22, 2010

Modeling the Data Warehouse

If Ralph Kimball were to publish a new revision of his book The Data Warehouse Toolkit, he should globally replace the words “data warehouse” with “data mart.” When he wrote this book in the early 90’s, the nomenclature we use today wasn’t fully clarified. This has caused a lot of confusion in the modeling of data warehouses.

Kimball advocated the use of dimensional modeling which utilizes mostly star schemas for use in analytical applications. To put it simply, star schema supports groupings on various dimensions for summarization and reporting. For example, a fact table could contain a transaction date while a date dimension table would contain various attributes for that date including calendar year, fiscal year, quarter, day-of-week, or whatever. Queries could join on this dimension to summarize on these various date attributes.

The problem is that in modern data warehouses, we typically don’t directly query the main data warehouse. Rather we query various data marts that are derived from the data warehouse. In most cases, it makes absolutely no sense to use star schema when modeling a data warehouse. When data can consist upwards of a billion records, joining on multiple dimension tables is usually not feasible.

There is no "one size fits all" that is appropriate for any given data warehouse. However, partitioned tables with partitioning and subpartitioning on the columns most frequently used for loading data warehouses usually makes sense. Data warehouse tables can also be denormalized to avoid the performance issues of joins on very large tables when loading data marts.

Of course, modeling a data warehouse involves a lot including various best practices and business-specific requirements. Following are some of the main considerations when modeling a data warehouse:

Data should contain detail. Summarization can be done at the data mart level where appropriate. You can never “un-summarize” data, so the main data repository should contain as much detail as possible.
The data warehouse should be modeled on a time basis in most cases. Data is non-volatile and reflects transactions that have happened at a point in time. There may be adjustment transaction records included, but there are not adjustments to data per say.
The data warehouse should contain as much history as possible. A large time window allows for more analytical possibilities such as trend analyses and historical reporting.
The data warehouse should be inclusive—not exclusive. There should be no integrity constraints, and records should not be excluded because of bad or missing values. Record exclusion can be done at the data mart level. There may be records that have missing or bad values in some columns that can still be useful in some unforeseen analysis or application. There are even cases where analyses can be skewed because of up-front data exclusion.
The data warehouse must be designed for performance. Populating data marts or creating other output from the data warehouse could involve millions or even billions of records. B-tree indexes are typically not used since they are only useful for selecting small subsets of records. Partitioning is crucial and should be based on record selection in queries—not necessarily how the business logically organizes data. Limited use of pre-joins (i.e. denormalization) is acceptable in a data warehouse. Data is non-volatile, so data anomalies caused by updates are not a problem. The use of pre-joins helps to avoid joins on extremely large tables when populating data marts.

Data marts which are created by extracting data from the main data warehouse comprise the data that is utilized by users and applications. These can be designed to meet business requirements. They can use exclusion criteria and summarize data. It is entirely appropriate to build data marts using dimensional modeling and star schema. However, keep in mind that although star schema is extremely powerful and useful, it is not appropriate for all applications and analyses. Always work closely with users and stakeholders when designing data marts.

July 15, 2010

SQL vs. Visual Query Tools

Since the early 90’s there have been many attempts to create graphical database query tools. I myself have designed and built both desktop and web-based query tools. I’ve also seen attempts by others including large software companies to develop query tools, data manipulation applications (e.g. ETL and BI products), forms, dashboards, and so on. Interestingly, there still seems to be no substitute for an old command line query tool: Structured Query Language (SQL).

Of course I understand the value of visual interfaces, but they are not a panacea. For example, take the following query:

SELECT column1, column2, column3, SUM(column4)
FROM table1
WHERE column1 BETWEEN ’01-JAN-10’ AND ’30-JUN-10’
AND column2 IS NOT NULL
AND column4 > 10000
GROUP BY column1, column2, column3 ;

The above SQL command takes a few seconds to create. It is understandable by someone without technical skills (perhaps with some minimal explanation). It is straightforward. It is simple.

Now compose the same query with a typical visual query tool:

Find the correct table in a dropdown list of tables.
Find column1, column2, column3, and column4 and drag them to some box on the screen.
Double-click on column1 to open up a dialog. Select something that says “range” or “selection criteria” or “column values” or whatever which opens up another dialog.
In the new dialog select “range” and select start and end values (’01-JAN-10’ and ’30-JUN-10’) from a list or manually type in the values.
Save and close the open dialogs.
Open another dialog on column2.
Use the dialog to figure out how to exclude NULLS.
Save and close the dialog.
Open another dialog for column4.
Figure out how to only select values greater than 10,000.
Save and close the dialog.
Find a button or link in the GUI that says something like “groupings” or “aggregations” and open the appropriate dialog.
Create groupings on column1, column2, and column3.
Save and close the dialog.
Find something in the GUI that says something like “functions” or “aggregate functions”.
Add a SUM( ) function to column 4.
Save and close the dialog.
Press Submit.

Was it really easier? Keep in mind that each vendor’s product has its own GUI—there is no standard. Also keep in mind that many vendors of these products also sell training which suggests that perhaps their GUI’s are not instantly intuitive.

SQL is the standard language of relational databases. With small variations, it is ubiquitous. It is easy to hire people with SQL skills. It is easy for non-programmers to learn. SQL can get somewhat complex for complicated queries, but that is also true for any query tool graphical or otherwise.

Again, I understand the value of visual interfaces. There are many useful applications that allow a user to access information with a few clicks. There are wonderful dashboards, graphing tools, and so on. But whenever serious data analysis or complex ETL is required, an enterprise shouldn’t disallow the most flexible and in many cases the simplest tool for querying data.

Change isn’t always progress. What works well tends to return.

June 19, 2010

Data Warehousing Tools

IT applications such as data warehousing and business intelligence (BI) have become very popular in recent years. New acronyms including ODS, ETL, DSS, OLAP, and others are frequently thrown around as of late. A mystique is created around these buzzwords and businesses are attracted to the latest technology trends. Consequently, there is an army of vendors trying to sell products to those businesses.

Unfortunately, these vendors are persuading many that things like OLAP and BI are products—not applications, that an ODS is something you buy rather than build, and that ETL is a tool that you purchase rather than a process. It is not necessarily a bad thing that there are products to help build these processes and applications, but it must be understood that products are not a substitute for knowledge. All of these processes require the stakeholders to understand the source data, what kinds of analyses they want to do, and what is the fastest, most economical, and most sensible way to accomplish that. There is not and never will be something you can buy to replace understanding. If a product can facilitate analytical business requirements then it may be worth purchasing. But upon closer examination, this is often far from true.

For example, imagine a hypothetical health care insurer that wants to examine how they can reduce medical expenses for their members. At a certain level of management it is decided that a BI product will be purchased to provide analytical tools. Vendors selling BI products are called in to make their presentations. One or two technical folks are asked to attend, but they are under political pressure to agree with the bosses. Professional sales people present well-prepared (though unrealistic) demos on sample data sets that show clever GUI’s, well-formatted reports, and eye-pleasing trend graphics and dashboards. Sometimes incentives are distributed to the attendees. One vendor wins the purchase.

Now for the reality check:

Licensing the product is costly.
The product is not a standalone product and must run on top of an existing RDBMS. In other words, there are no savings from another license being replaced.
Additional costs include product training for staff, ramp-up time for staff to become proficient enough to build meaningful processes, product support from the vendor, and mandatory product upgrades.
In addition to the work of building processes with the BI tool, considerable work has to be done up front in the RDBMS to prepare data to make it useable and available to the BI tool.
The new BI product requires data in a dimensional model (typically star schema). The staff doesn’t really understand dimensional modeling and requires further education (in addition to product training). If dimensional modeling is not well understood, even minor errors in data structures can cause inaccurate analyses and reporting.
The source data (in this example health care claims data) is complex and doesn’t fit well into a dimensional model.
BI tools are typically designed for groupings and aggregations (a/k/a slice-and-dice applications or cubes). More complex analyses and studies cannot be completed using the BI tool.
People start fudging to make deadlines and meet management expectations.
QC finds considerable anomalies in reports and output from the BI tools.
Problems are hidden and phony status reports about the implementation are presented up the management chain.
Staff members become fed up with the needless stress and long hours and seek employment elsewhere.
Yet more costs and delays.
After considerable expense and time, the original goal of purchasing the BI tool is not accomplished.

All of this may seem an exaggeration, but similar scenarios are being played out in many businesses and organizations in all industry sectors. Some who read this may have already experienced the pain of similar situations. These problems can be avoided with some basic understanding.

Understand your business requirements. Although not all analytical requirements can be foreseen, key stakeholders should have a very good sense of the types of applications they will be pursuing. Many real-world analytical questions should be considered up front. Fancy presentations by sales persons may look good to those making purchase decisions, but they may be irrelevant to the actual analytical needs. Key business users such as analysts, statisticians, accountants, and others involved in research and informatics need to be involved in decision making concerning BI.

Understand your data. Sometimes the source data just does not contain the information to fulfill certain analyses and queries. There is nothing magical about data. A thorough understanding of the content, population of variables, accuracy, and other attributes are necessary before any intelligence can be derived from it. There is no product or tool that can make something out of data that it is not.

Know your RDMBS. Most mainstream database systems such as Oracle or SQL Server already have powerful tools for manipulating data. In fact, that’s exactly what an RDBMS does. Because it’s called “relational” doesn’t mean it is limited to normalized relational structures. Star schemas, snowflake schemas, cubes, and all sorts of data marts can be created with any mainstream database product. Database systems already contain powerful languages that are capable of any ETL. (In fact expensive ETL tools are only front-ends that generate code in the native RDBMS). All mainstream database systems contain powerful SQL extensions to support virtually any type of query, analysis, or report. For very large data sets (e.g. billions of records) most add-on tools cannot utilize special features of the RDBMS to provide reasonable performance. Consult with your developers and DBA's before making purchase decisions.

Consider minimalist solutions. Various tools always add considerable complexity and frequently add unnecessary convolutions. Huge amounts of money and time can be invested in something that could have been accomplished much more easily. For example, a major BI implementation can sometimes be replaced with a few straightforward data marts, a little programming, and simple query tools. Less money. Less time. Faster results. More flexible. More scalable. More accurate analyses and reports. Sometimes the shortest distance between two points is a straight line.