November 20, 2011

Rapidly Emerging Technology Series: Massively Parallel Processing

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals.  This posting discusses massively parallel processing.

Parallel processing distributes a process into multiple threads so that they can be performed simultaneously.  In database systems, parallel processing allows queries to be performed on data distributed in different locations at the same time and then combine the results.

Database systems with massively parallel processing (MPP) can distribute processes across hundreds of nodes.  A node is a separate server with its own software and storage.  A very large table could be distributed across the nodes so that a query can process against all of the nodes at the same time.  The results are then combined and returned to the requestor.

MPP systems have a controller node which does the work of distributing data and processes.  A developer could create DDL for a table without consideration of how the data would be distributed across the system, and the controller would automatically allocate storage on each of the nodes for this table.  When inserting data, the system has the intelligence to hash the data and create a balanced distribution.  When querying the data, the controller instantly converts the query into code that would run against all of the nodes simultaneously.

MPP databases only make sense for data warehousing and OLAP.  There would be no performance gains from the MPP when performing OLTP operations such as inserting or updating individual records or querying for small sets of records.

No comments:

Post a Comment