Clarity in Data Warehousing: emerging technology

Showing posts with label emerging technology. Show all posts

November 20, 2011

Rapidly Emerging Technology Series: Cloud Databases

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals. This posting discusses cloud databases.

A growing trend in database systems is storing data in the cloud. The cloud metaphor is taken from the flowchart symbol of a cloud that represents data movement over the Internet (hence not visible to the flowchart….in the cloud). Companies including Microsoft, Oracle, IBM and Amazon offer cloud storage services on vast server and storage complexes.

Advantages of storing data with a cloud service is that they will perform all administrative services, replicate data on various severs for instantly available backups, and automatically scale growing databases and applications. These services could result in considerable cost savings for some organizations. Database development can be done without a large commitment to hardware purchases and administrative staff, and purchase and hiring decisions can be postponed for later assessment.

For example, Microsoft offers a product called SQL Azure which is a cloud-based SQL Server service. SQL Azure makes it possible to store and access SQL Server data on cloud-based servers as well as integrate cloud and local data in a hybrid database. SQL Azure databases can contain SQL Server tables, indexes, stored procedures, and other common database objects.

Cloud servers can be public or private. A public cloud is a service purchased from a cloud services vendor. A private cloud is set up by an organization with its own servers and storage. A public cloud has the advantage of very sophisticated administrative and scaling capabilities that would be very complicated and expensive for smaller organizations to set up themselves. A private cloud may be used when regulatory requirements are needed above and beyond the security provided by public clouds.

A cloud hybrid database is one where data is stored on both cloud and local servers. This may be done for a number of reasons including replication of critical data on separate servers, separating processing of various data sources, providing separate access to various data marts, and other data distribution and access requirements.

Rapidly Emerging Technology Series: Self-Service BI

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals. This posting discusses self-service BI.

A very rapidly evolving area of information access is called Self-service Business Intelligence or Self-service BI. Business users want more flexible access to information than standard reports can provide and not have to rely on analysts who can write code.

For many years, companies like Cognos and MicroStrategies have provided BI tools that work on top of relational databases. Typically, the database team would set up a star schema data mart, and BI developers would create a browser-based front end would allow easy grouping and aggregation (slice-and-dice) access to the data with sums and counts. BI products also provide tools to build dashboards that give very current views of data—typically in an operational data store (ODS). Dashboards can also include data visualization in the form of graphs and gauges.

However, these tools are evolving to include much more sophisticated access for users. Unlike older BI tools that sat on top of a pre-defined data mart, newer self-service BI products provide access to multiple data sources including enterprise databases as well as publicly accessible databases and subscription databases on the web. The BI tool allows for integration of these diverse data sources at the user level for complex reports and graphs.

Most analyses involve more than single queries. There is usually a discovery process which leads to more questions and subsequently more analyses. Newer BI products support this discovery process. Self-service BI allows an analysis to be built upon by easily changing variables, adding data sources, re-use of results, and so on.

Newer BI tools are being developed that also allow mobile access from a variety of devices. Users will increasingly need to access and visualize data from their smart phones and tablets.

Microsoft has invested heavily in Self-service BI. Their SQL Server product contains tools that allow tight integration with Excel and all of its analytical and graphing capabilities as well as with SharePoint to promote various types of live data access throughout an organization. SQL Server 2012 which will be released in coming months has a tool called PowerView which is a very flexible and powerful BI front end. It supports integration of diverse data sources at the user level, self-service alerts, sophisticated graphing, document authoring with live data updates, and many other analytical tools.

(A pre-release version of SQL Server 2012 is available for free download at http://www.microsoft.com/sqlserver/en/us/get-sql-server/try-it.aspx.)

Rapidly Emerging Technology Series: Unstructured Databases

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals. This posting discusses unstructured databases.

When data grows to many terabytes, exabytes, and zettabytes, it can no longer be managed or queried using traditional relational databases. Examples are the massive amounts of data managed by companies like Google or Facebook. Examples of data that can occupy massive amounts of storage in health care are genomics data and data from electronic patient monitoring systems. Because of the massive amounts of data in these databases, they cannot be structured according to relational database models and are hence called unstructured databases. The term big data is frequently used to describe unstructured databases.

A pioneer in developing new ways of managing big data was Google. They created a framework for distributing and processing huge amounts of data over many nodes including a storage system called Google File System, and a language called MapReduce for querying distributed data.

A popular big data solution based on Google’s is called Hadoop. Versions of Hadoop have been adapted by Yahoo, Facebook, and Amazon. Hadoop is being incorporated into commercial relational database products, and it is freely available as open source software. Both Microsoft and Oracle have announced Hadoop integration with their newly released relational database products.

When large data providers Yahoo and Facebook implemented Hadoop, they added higher level programming tools than MapReduce called Hive (Facebook) and Pig (Yahoo). These languages mimic SQL but have the ability to distribute queries on distributed nodes. Because big data databases use non-relational queries, they are frequently referred to as NoSQL databases.

Unstructured databases are very good at quickly storing and accessing extremely large amounts of data. However, they do not have the more complete functionality of relational database management systems. For example, Hadoop does not support complex data models, complex analytical queries, referential integrity, and other RDBMS capabilities. NoSQL databases only provide very simple data organization and simple text-based querying.

Rapidly Emerging Technology Series: Massively Parallel Processing

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals. This posting discusses massively parallel processing.

Parallel processing distributes a process into multiple threads so that they can be performed simultaneously. In database systems, parallel processing allows queries to be performed on data distributed in different locations at the same time and then combine the results.

Database systems with massively parallel processing (MPP) can distribute processes across hundreds of nodes. A node is a separate server with its own software and storage. A very large table could be distributed across the nodes so that a query can process against all of the nodes at the same time. The results are then combined and returned to the requestor.

MPP systems have a controller node which does the work of distributing data and processes. A developer could create DDL for a table without consideration of how the data would be distributed across the system, and the controller would automatically allocate storage on each of the nodes for this table. When inserting data, the system has the intelligence to hash the data and create a balanced distribution. When querying the data, the controller instantly converts the query into code that would run against all of the nodes simultaneously.

MPP databases only make sense for data warehousing and OLAP. There would be no performance gains from the MPP when performing OLTP operations such as inserting or updating individual records or querying for small sets of records.

Rapidly Emerging Technology Series: Column-Oriented Databases

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals. This posting discusses column-oriented databases.

Data in a typical relational database is organized in tables by row. The row paradigm is used for physical storage as well as the logical organization of data.

Column-Oriented or column-based databases physically organize data by column while still presenting data as rows. Organizing data by column makes it self-indexing since each column becomes the equivalent of an index key. Each column has table and row identifiers so that columns can be combined to produce rows of data in a table. Any column can be used in the WHERE clause of a query, the data elements can quickly be found in that column, and columns with the same row identifiers are assembled for the query results.

Microsoft SQL Server 2012 (to be released in 2012) has a new feature called Columnstore Indexes which provide column-oriented storage for tables. Microsoft claims that these can speed up queries up to 10 times. A major advantage of the new SQL Server Columnstore Indexes is that you can index all columns in a large fact table. This enables very fast performance when joining foreign keys in very large fact tables to primary keys in dimensions for star joins. With B-tree indexes, there are penalties for indexing too many columns. In a typical fact table, most columns other than the primary key are foreign keys to dimension tables, and Columnstore indexes will allow all of those columns to be indexed using column-oriented storage.

Oracle 11g supports column-oriented storage of data but only on its proprietary Exadata storage servers or Exadata appliances.

Informative links for information about column-oriented Databases:

http://en.wikipedia.org/wiki/Column-oriented_DBMS

http://nms.csail.mit.edu/~stavros/pubs/tutorial2009-column_stores.pdf

Rapidly Emerging Technology Series: Database Appliances

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals. This posting discusses database appliances.

A database appliance is an integrated, preconfigured package of RDBMS software and hardware. Most major database vendors including Microsoft, Oracle, IBM, and TeraData package and sell database appliances. Data warehouse appliances are the biggest selling database appliances.

Database systems utilize memory, I/O, processing cores, and storage for database processes, and they need to formulate execution plans that will utilize these resources efficiently. Hardware configurations for database performance—particularly data warehousing—are not necessarily the same as configurations for other purposes. In fact, sometimes database performance isn’t even considered when purchasing and configuring hardware. In those situations, even the most experienced DBA's and systems administrators aren't always able to optimize systems to get satisfactory performance.

A database appliance is a pre-configured hardware and software solution. Most database appliances are designed for specialized applications such as OLTP or data warehousing. The servers, storage, OS, and RDBMS software are integrated and optimized for performance.

Some database appliances utilize parallel processing to distribute workloads across server nodes. Multi-node systems can be share-everything allowing multiple servers to share storage, or share-nothing where each server has its own storage. Share-everything systems tend to be more expensive yet allow the same data to be accessed by several servers. Share-nothing systems can distribute data from the same tables across multiple nodes so that queries can be processed in parallel. A share-nothing system is useful for querying very large fact tables in a data warehouse.

Database appliances generally do not scale well outside of the initial configuration. For example, you generally don’t add storage to a database appliance. Data warehouse appliances are available to support from about 5 terabytes to 100’s of terabytes of data.

Database appliances can also be very costly. In many situations, it may be possible to get satisfactory database performance with much less expensive hardware purchases.

Microsoft:http://www.microsoft.com/sqlserver/en/us/solutions-technologies/appliances.aspx

Oracle: http://www.oracle.com/us/products/database/database-appliance/index.html

IBM/Netezza: http://www.netezza.com/data-warehouse-appliance-products/index.aspx

Teradata: http://www.teradata.com/data-appliance-data-warehouse/

Wikipedia article on Data Warehouse Appliances: http://en.wikipedia.org/wiki/Data_warehouse_appliance