– Jack Norris, vice president, marketing, MapR Technologies (www.mapr.com), says:

With the Internet now touching more than two billion people daily, every call, tweet, e-mail, download, or purchase generates valuable data. There is also a wealth of machine-generated data such as log files, sensor data, video images, genomic data, etc that is growing at an even faster rate. Companies are increasingly relying on Hadoop to unlock the hidden value of this rapidly expanding data and to drive increased growth and profitability. A recent IDC study confirmed that data is growing faster than Moore’s Law. The implication of this growth rate is that however you’re processing data today will require doing it with a larger cluster tomorrow.

Put another way, the speed of data growth has changed the bottleneck. The network is the bottleneck. It takes longer to move data over the network than it takes to perform the analysis. Hadoop represents a new paradigm to effectively analyze large amounts of data. This new computing paradigm performs data and compute together so that only the results are shared over the network.

When beginning the evaluation and selection of the various Hadoop distributions, organizations need to understand the criteria that mean the most to their business or activity. Key questions to ask include:

  •       How easy is it to use?

How easily does data move into and out of the cluster?

Can the cluster be easily shared across users, workloads, and geographies?

Can the cluster easily accommodate access, protection, and security while supporting large numbers of files?

  •       How dependable is the Hadoop cluster?

Can it be trusted for production and business-critical data?

How does the distribution help ensure business continuity?

Can the cluster recover data from user and application errors?

Can data be mirrored between different clusters?

  •       How does it perform?

Is processing limited to batch applications?

Does the namenode create a performance bottleneck?

Does the system use hardware efficiently

In order for Hadoop to be effective for a broad group of users and workloads, it must be easy to use, provision, operate and manage at scale. It should be easy to move data into and out of the cluster, provision cluster resources, and manage even very large Hadoop clusters with a small staff. It is advisable to look for real-time read/write data access via the industry standard file protocols such as NFS. This will make it dramatically easier to get data into and out of Hadoop without requiring special connectors. Most Hadoop distributions are also limited by the write-once Hadoop Distributed File System (HDFS). Like a conventional CD-ROM, HDFS prevents files from being modified once they have been written, requiring a file append, and files must be closed before new updates can be read.

As data analysis needs grow, so does the need to effectively manage and utilize expensive cluster resources. It is often useful for organizations to have separate data sources and applications leveraged by the same Hadoop cluster. Ways to segment a cluster by user groups, projects, or divisions are also useful. The ability to separate a physical cluster into multiple, logical Hadoop clusters is very useful. A distribution should also be designed to work with multiple clusters and multi-cluster management. It is critical to look for simple installation, provisioning and manageability,

Data processing demands are becoming increasingly critical and these demands require the selection of a distribution that provides enterprise class reliability and data protection.One area of concern is the single points of failure that exist, particularly in the NameNode and JobTracker functions.  There is no HA available today in Apache Hadoop. While there are some HA capabilities expected in the next major release of Hadoop it is for a single failover of the NameNode and there is no failback capability and no protection against multiple NameNode failures.

Hadoop provides replication to protect against data loss, but for many applications and data sources, snapshots are required to provide point-in-time recovery to protect against end-user and application errors. Full business continuity features including remote mirroring, is also required in many data centers to meet recovery time objectives across data centers.

Data center computing is going through one of the largest paradigm shifts in decades. Are you ready for the change? Are you ready for Hadoop?