Showing posts with label parallel processing. Show all posts
Showing posts with label parallel processing. Show all posts

November 20, 2011

Rapidly Emerging Technology Series: Massively Parallel Processing

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals.  This posting discusses massively parallel processing.

Parallel processing distributes a process into multiple threads so that they can be performed simultaneously.  In database systems, parallel processing allows queries to be performed on data distributed in different locations at the same time and then combine the results.

Database systems with massively parallel processing (MPP) can distribute processes across hundreds of nodes.  A node is a separate server with its own software and storage.  A very large table could be distributed across the nodes so that a query can process against all of the nodes at the same time.  The results are then combined and returned to the requestor.

MPP systems have a controller node which does the work of distributing data and processes.  A developer could create DDL for a table without consideration of how the data would be distributed across the system, and the controller would automatically allocate storage on each of the nodes for this table.  When inserting data, the system has the intelligence to hash the data and create a balanced distribution.  When querying the data, the controller instantly converts the query into code that would run against all of the nodes simultaneously.

MPP databases only make sense for data warehousing and OLAP.  There would be no performance gains from the MPP when performing OLTP operations such as inserting or updating individual records or querying for small sets of records.

Rapidly Emerging Technology Series: Database Appliances

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals.  This posting discusses database appliances.

A database appliance is an integrated, preconfigured package of RDBMS software and hardware.  Most major database vendors including Microsoft, Oracle, IBM, and TeraData package and sell database appliances.  Data warehouse appliances are the biggest selling database appliances.

Database systems utilize memory, I/O, processing cores, and storage for database processes, and they need to formulate execution plans that will utilize these resources efficiently.  Hardware configurations for database performance—particularly data warehousing—are not necessarily the same as configurations for other purposes.  In fact, sometimes database performance isn’t even considered when purchasing and configuring hardware.  In those situations, even the most experienced DBA's and systems administrators aren't always able to optimize systems to get satisfactory performance.

A database appliance is a pre-configured hardware and software solution.  Most database appliances are designed for specialized applications such as OLTP or data warehousing. The servers, storage, OS, and RDBMS software are integrated and optimized for performance.

Some database appliances utilize parallel processing to distribute workloads across server nodes. Multi-node systems can be share-everything allowing multiple servers to share storage, or share-nothing where each server has its own storage.  Share-everything systems tend to be more expensive yet allow the same data to be accessed by several servers. Share-nothing systems can distribute data from the same tables across multiple nodes so that queries can be processed in parallel.  A share-nothing system is useful for querying very large fact tables in a data warehouse.

Database appliances generally do not scale well outside of the initial configuration.  For example, you generally don’t add storage to a database appliance.  Data warehouse appliances are available to support from about 5 terabytes to 100’s of terabytes of data.

Database appliances can also be very costly.  In many situations, it may be possible to get satisfactory database performance with much less expensive hardware purchases.


Wikipedia article on Data Warehouse Appliances: http://en.wikipedia.org/wiki/Data_warehouse_appliance