– Peter Quirk, director of product management, Sepaton, Inc., says:
Data is the essence of information technology and, increasingly, a resource that can help yield more value from an enterprise. It is also something that must be maintained flawlessly to meet regulatory, compliance, requirements and SLAs.
Unfortunately, data is growing far faster than was predicted even a few years ago – and the challenge of data growth has been compounded by the expanding range of data types and data sources. The costs and complexities associated with data protection requirements and meeting retention and compliance goals – while trying to maintain IT budgets and facilitate the extraction of value – has become almost unbearable.
Deduplication, which has done much to tame the data explosion and has become an established part of IT operations, no longer meets the needs – and expectations – of IT departments, costing more than expected and reducing storage requirements by less than expected.
Modern technologies like Object-based File Systems are designed to move massive data in parallel but too many dedupe engines are based on single stream, single node architectures using older traditional file systems.
Traditional deduplication architectures worked by calculating a hash value and looking up an index to see if the data is unique or can be replaced by a pointer. Even when working from memory or flash, the index quickly becomes a performance and scalability bottleneck if not controlled in size. When chunk sizes are smaller (good for deduplication) the index can become very large as data is loaded – this makes subsequent lookups slow. Existing inline systems address this index size problem by implementing larger chunk sizes in order to keep the index size manageable. For example, breaking a 1TB file into 1MB chunks yields a million chunks, but a 1KB size yields one billion with obvious search/insert time downsides. When vendors reduce granularity by limiting the smallest chunk size to get a manageable index, they reduce deduplication ratios, often dramatically.
What’s needed is a modern, federated, distributed object-based file system. An object-oriented federated file system is a per-application global file naming facility that the application can use to access files across the cluster in a location-independent manner. This delivers tremendous flexibility by providing the ability to write data in any way; using methods that are traditional – while still providing the ability to inject dedupe into the process. Plus, using federated, multilayered fast look up objects, instead of an old fashioned centralized index, means far greater speed and scalability.
Better deduplication technology is just the start. A modern GUI built on the latest web tools, allowing access from anywhere (including mobile devices) and to providing fast/easy integration with external management environments and tools is also required. For example, interfaces built on REST (Representational State Transfer) API, the leading web API design model lets systems integrate with reporting tools, spreadsheets, scripts and other tools that are part of the management environment; allowing users to track the status of jobs, performance, dedupe rates and similar metrics.
Pay-as-You-Grow Grid Scalabilitythat runs optimized on a single system that continuously rebalances workloads as nodes and capacity are added for seamless scalability; delivering scalability in both capacity and performance is another key criteria of a modern data protection systems.
Further architectural flexibility comes from broad protocol support designed to work equally well in web and virtualization environments as it does in traditional environments where the NAS protocols dominate. Support for legacy NAS protocols including NFS, CIFS, and NDMP with OST, as well as popular web-based protocols (e.g. REST), provide for very fast, parallel dedupe-enabled ingest and outgress. This flexibility helps to remove storage bottlenecks and increase scalability.
Data centers face the unremitting challenge of storage growth, increased complexity and budget pressure. Effective deduplication is a critical tool for meeting these challenges. But existing deduplication products haven’t kept pace and are increasingly outmoded; wasting IT staff time, threatening SLAs and leading to expensive appliance sprawl and complexity.
Peter Quirk, director of product management, Sepaton, Inc., has more than 20 years of product management experience, overseeing the development of products that address the information management needs of large enterprises with emphasis on storage, archiving, classification, HSM and data protection solutions.