Clarity in Data Warehousing: Partitioning Data Warehouse Data

Introduction

W.H. Inman describes partitioning in data warehousing as "the breakup of data into separate physical units that can be handled independently." [1] For very large data sets, rows of data of the same type are separated into subsets called partitions. Partitions provide advantages for both database administration and for query performance.

Partitioning and Database Administration

Database administration tasks such as moving, replicating, exporting, and archiving data can be greatly facilitated by partitioning. An obvious fact is that it is easier to work with several small units than one massive unit.

For example, if a data warehouse contains a rolling 5-year window of sales order data, each month the the oldest partition (older than current month minus 5 years) can be archived and dropped without affecting the other partitions. No massive deletes. No index rebuilds. No regeneration of system statistics. Similarly, new partitions can be created and new incoming data can be inserted into those partitions without affecting the indexing or the system statistics of the older partitions.

Other partitioning features that can facilitate database administration include the ability to easily relocate partitions to different file systems and disks. For example, Oracle allows a DBA to easily move data between partitions and individual tables and between different tablespaces.

Partitioning and Query Performance

For data warehouse developers and users, partitioning is very important for improving the performance of queries and ETL against very large data sets. Queries can be run against separate partitions, and in modern RDBMS systems such as Oracle and SQL Server, intelligence is built-in so that queries against very large data sets can select only the relevant partitions.

Before partitioning technology was introduced in Oracle 8, data warehouses frequently broke very large data sets into separate tables. For example, instead of a very large SalesOrders table containing several years of data, smaller tables each containing one month of data could be created. Tables could be named SalesOrders199901, SalesOrders199902, SalesOrders199903, and so on, with the YYYYMM portion of the name indicating which month of data the table contains. Queries could be run against each monthly table (i.e. partition) for the relevant date ranges, and the separate results could be gathered with UNION operators. For example:

SELECT * FROM SalesOrders199901 WHERE RegionCode = 5572

UNION

SELECT * FROM SalesOrders199902 WHERE RegionCode = 5572

UNION

SELECT * FROM SalesOrders199903 WHERE RegionCode = 5572 ;

Alternatively, each result could be inserted into a temporary table, and the temporary tables can then be gathered with UNION operators. That may reduce system memory management requirements and further improve performance.

Either way, results would be returned much more quickly by querying only the relevant months rather than querying a very large table containing multiple years of data. This performance advantage would be realized in complex real world queries and over many partitions as well.

Now that Oracle Enterprise Edition includes partitioning capabilities, single tables can now be created containing multiple partitions. For example, instead of using separate monthly sales order tables as in the example above, the data could be put into one large partitioned table named SalesOrder. The SalesOrder table could contain multiple partitions such as P199901, P199902, P199903, and so on based on the OrderDate column. Partition P199901 would contain only data where OrderDate is in January, 1999, P199902 would contain February 1999 data, and so on. If a query against the SalesOrder table contains OrderDate in the WHERE clause, only the relevant partitions would be queried. Oracle calls this partition pruning.

SELECT * FROM SalesOrders WHERE OrderDate BETWEEN '01/01/1999' AND '03/31/1999'

For the above query, only the P199901, P199902, and P19993 partitions would be queried. Without partition pruning, many years of data may have to be scanned to find data with the appropriate date range. For queries that return large numbers of rows in the results, partition pruning provides much better query performance than B-tree indexes.

Frequently, the column value on which a table is partitioned is not useful in the query. Perhaps an entire 5 years of data needs to be queried, or perhaps there are other selection criteria in the WHERE clause that have nothing to do with the column used for partitioning. In these cases, better performance can be gained by writing the query against each partition.

SELECT * FROM SalesOrders PARTITION P199901 WHERE RegionCode = 5572 AND OrderType = 'V'

UNION

SELECT * FROM SalesOrders PARTITION P199902 WHERE RegionCode = 5572 AND OrderType = 'V'

UNION

SELECT * FROM SalesOrders PARTITION P199903 WHERE RegionCode = 5572 AND OrderType = 'V'

Notice that this query is very much like writing against separate tables (i.e. SalesOrders199901, SalesOrders199102, and SalesOrders199903). One of the main advantages of the partitioning feature in an RDBMS is to simplify queries, but this example demonstrates that that doesn't always turn out to be the case.

In addition to partition pruning, performance is sometimes improved by the use of parallel processing. Queries written to use parallelism can query multiple partitions at the same time. Even better performance can be gained by parallel processes against partitions stored on separate disks.

Partitioning with Oracle

Oracle has rich partitioning features including multiple partition types (range, list, hash), automatic partition pruning, and subpartitioning.

Subpartitioning is simply the partitioning of partitions. A subpartition can be created using a different column than the main partition. For example, the SalesOrder table which is partitioned on OrderDate can contain a subpartition on RegionCode which would further improve performance on the example query above.

Partitions are created using the CREATE TABLE syntax when creating a table. Partitions can be altered, dropped , renamed, truncated, or moved using the ALTER TABLE syntax. Data can be exchanged between partitions and other tables also using the ALTER TABLE syntax.

Partitioning with SQL Server

SQL Server implements partitioning differently. Initially, SQL Server implemented partitioning as partitioned views. This method utilized separate tables for each partition, but those tables are brought together in a view with UNION operations. First you create the tables (i.e. partitions) with the CREATE TABLE syntax.

CREATE TABLE SalesOrders199901

(OrderID int, CustomerID int, PartID int, OrderDate datetime, OrderAmt money,

CONSTRAINT CK_OrderDate199901 CHECK (OrderDate BETWEEN '01/01/1999' AND '01/31/1999') ,

CONSTRAINT PK_SalesOrders19991 PRIMARY KEY (OrderID, OrderDate) )

CREATE TABLE SalesOrders199902

(OrderID int, CustomerID int, PartID int, OrderDate datetime, OrderAmt money,

CONSTRAINT CK_OrderDate199902 CHECK (OrderDate BETWEEN '02/01/1999' AND '02/28/1999') ,

CONSTRAINT PK_SalesOrders19992 PRIMARY KEY (OrderID, OrderDate) )

CREATE TABLE SalesOrders199903

(OrderID int, CustomerID int, PartID int, OrderDate datetime, OrderAmt money,

CONSTRAINT CK_OrderDate199903 CHECK (OrderDate BETWEEN '03/01/1999' AND '03/31/1999') ,

CONSTRAINT PK_SalesOrders19993 PRIMARY KEY (OrderID, OrderDate) )

Then the partitioned view is created on the range tables using CREATE VIEW.

CREATE VIEW SalesOrders WITH SCHEMABINDING AS

SELECT OrderID, CustomerID, PartID, OrderDate, OrderAmt FROM SalesOrders199901

UNION ALL

SELECT OrderID, CustomerID, PartID, OrderDate, OrderAmt FROM SalesOrders199902

UNION ALL

SELECT OrderID, CustomerID, PartID, OrderDate, OrderAmt FROM SalesOrders199903

The view becomes an updateable view. Inserted data will automatically go into the correct partitions, and queries using OrderDate in the WHERE clause will automatically select only the relevant partitions.
In SQL Server 2005, real table partitioning was introduced, but it is awkward and more difficult to implement than partitioning in Oracle. Table partitioning in SQL Server is done in three steps.

· Create a partition function which defines the range or list values for each partition (CREATE PARTITION FUNCTION).

· Create a partition scheme which maps partition to file groups on disk (CREATE PARTITION SCHEME).

· Create a table that specifies the partition scheme and which columns are used in the partitioning (CREATE TABLE).

The complexity of implementing table partitioning in SQL Server is unfortunate and probably deters widespread use. Partitioned views are currently easier to implement and manage in SQL Server.

Summary

Partitioning data into smaller sets of rows is useful if not essential for large data warehouses and analytical databases. Partitioning can simplify location and movement of data for database administrators. It provides significant performance gains for queries and ETL processes especially when strategically placing partitions across file systems and when used in parallel processing. Oracle provides more extensive partitioning capabilities which are relatively easiy to implement. Partitioning in SQL Server has less features and is more difficult to implement. However, partitioned views are easy to implement in SQL Server and can provide most of the advantages of partitioning.

[1] W.H. Inman, Building the Data Warehouse, Second Edition (New York: John Wiley & Sons, Inc., 1996), 55.

Clarity in Data Warehousing

March 8, 2011

Partitioning Data Warehouse Data

1 comment: