– Jack Norris, vice president of marketing at MapR Technologies (www.mapr.com), says:
With the Internet now touching more than two billion people daily, every call, tweet, e-mail, download, or purchase generates valuable data. There is also a wealth of machine-generated data such as sensor data, video images, genomic data, etc that is growing at an even faster rate. Companies are increasingly relying on Hadoop to unlock the hidden value of this rapidly expanding data, and to drive increased growth and profitability. A recent IDC study confirmed that data is growing faster than Moore’s Law. The implication of this growth rate is that however you’re processing data today will require doing it with a larger cluster tomorrow.
Put another way, the speed of data growth has changed the bottleneck. The network is the bottleneck. It takes longer to move data over the network than it takes to perform the analysis. A new computing paradigm is emerging to address this inefficiency — performing data and compute together so only the results are shared over the network. The promise of Hadoop is the ability to effectively analyze large amounts of data with a new paradigm.
If you’re beginning the evaluation and selection of Hadoop, organizations need to understand the criteria that mean the most to their business or activity. There are a few distributions from which to choose when selecting Hadoop. Key questions to ask include:
- How easy is it to use?
How easily does data move into and out of the cluster?
Can the cluster be easily shared across users, workloads, and geographies?
Can the cluster easily accommodate access, protection, and security while supporting large numbers of files?
- How dependable is the Hadoop cluster?
Can it be trusted for production and business-critical data?
How does the distribution help ensure business continuity?
Can the cluster recover data from user and application errors?
Can data be mirrored between different clusters?
- How does it perform?
Is processing limited to batch applications?
Does the namenode create a performance bottleneck?
Does the system use hardware efficiently?
In order for Hadoop to be effective for a broad group of users and workloads, it must be easy to use, provision, operate and manage at scale. It should be easy to move data into and out of the cluster, provision cluster resources, and manage even very large Hadoop clusters with a small staff. It is advisable to look for real-time read/write data flows via the industry standard Network File System (NFS) protocol. Hadoop distributions are also limited by the write-once Hadoop Distributed File System (HDFS). Like a conventional CD-ROM, HDFS prevents files from being modified once they have been written, and files cannot be read before they are closed.
As data analysis needs grow, so does the need to effectively manage and utilize expensive cluster resources. It is often useful for organizations to have separate data sources and applications leveraged by the same Hadoop cluster. Ways to segment a cluster by user groups, projects, or divisions are also useful. The ability to separate a physical cluster into multiple, logical Hadoop clusters is very useful. A distribution should also be designed to work with multiple clusters and multi-cluster management. It is critical to look for simple installation, provisioning and manageability,
Data processing demands are becoming increasingly critical and these demands require the selection of a distribution that provides enterprise class reliability and data protection. Hadoop provides replication to protect against data loss, but for many applications and data sources, snapshots are required to provide point-in-time recovery to protect against end-user and application errors. Full business continuity features including remote mirroring, is also required in many data centers to meet recovery time objectives across data centers.
Data center computing is going through one of the largest paradigm shifts in decades. Are you ready for the change? Are you ready for Hadoop?
About Jack Norris, vice president of marketing, MapR.
Jack has over 20 years of enterprise software marketing experience. He has demonstrated success from defining new markets for small companies to increasing sales of new products for large public companies. Jack’s broad experience includes launching and establishing Aster Data, driving Rainfinity (EMC) to a market leadership position, and leading marketing and business development for an early-stage cloud storage software provider.
MapR delivers on the promise of Hadoop, making managing and analyzing Big Data a reality for more business users. The award-winning MapR Distribution brings unprecedented dependability, speed and ease-of-use to Hadoop combined with data protection and business continuity, enabling customers to harness the power of Big Data analytics