November 20, 2011

Rapidly Emerging Technology Series: Unstructured Databases

The Rapidly Emerging Technology Series highlights current technologies that are relevant to data warehouse professionals.  This posting discusses unstructured databases.

When data grows to many terabytes, exabytes, and zettabytes, it can no longer be managed or queried using traditional relational databases.  Examples are the massive amounts of data managed by companies like Google or Facebook.  Examples of data that can occupy massive amounts of storage in health care are genomics data and data from electronic patient monitoring systems.  Because of the massive amounts of data in these databases, they cannot be structured according to relational database models and are hence called unstructured databases.  The term big data is frequently used to describe unstructured databases.

A pioneer in developing new ways of managing big data was Google.  They created a framework for distributing and processing huge amounts of data over many nodes including a storage system called Google File System, and a language called MapReduce for querying distributed data.

A popular big data solution based on Google’s is called Hadoop.  Versions of Hadoop have been adapted by Yahoo, Facebook, and Amazon.  Hadoop is being incorporated into commercial relational database products, and it is freely available as open source software.  Both Microsoft and Oracle have announced Hadoop integration with their newly released relational database products.

When large data providers Yahoo and Facebook implemented Hadoop, they added higher level programming tools than MapReduce called Hive (Facebook) and Pig (Yahoo).  These languages mimic SQL but have the ability to distribute queries on distributed nodes.  Because big data databases use non-relational queries, they are frequently referred to as NoSQL databases.

Unstructured databases are very good at quickly storing and accessing extremely large amounts of data.  However, they do not have the more complete functionality of relational database management systems.  For example, Hadoop does not support complex data models, complex analytical queries, referential integrity, and other RDBMS capabilities.  NoSQL databases only provide very simple data organization and simple text-based querying.

No comments:

Post a Comment