Checksums in Storage Systems and Why the Enterprise Should Care

Random bit flips are far more common than most people, even IT professionals, think. Surprisingly, the problem isn’t widely discussed, even though it is silently causing data corruption that can directly impact our jobs, our businesses, and our security. It’s really scary knowing that such corruptions are happening in the memory of our computers and servers – that is before they even reach the network and storage portions of the stack. Google’s in-depth study of bit-level DRAM errors showed that such uncorrectable errors are a fact of life. And, do you remember the time when Amazon had to globally reboot their entire S3 service due to a single bit error?

The Error-Prone Data Trail

Let’s assume for a moment that your data survives its many passes through a system’s DRAM and emerges intact. That data must then be safely transported over a network to the storage system where it is written to disk. How do you assure the data remains unaltered along the way? Well, if you’re using one of the storage protocols that lack end-to-end checksums (e.g. NFSv2, NFSv3, SMBv2), your data remains susceptible to random bit flips and data corruption. Even NFSv4 plus Kerberos 5 with integrity checking (krb5i) doesn’t offer true end-to-end checksums. Once the data is extracted from the RPC, it is unprotected once again. In addition, the widespread adoption of NFSv4 hasn’t happened, and fewer still use krb5i.

Over a decade ago, the folks at CERN urged that “checksum mechanisms (…) be implemented and deployed everywhere.” This appeal only amplifies today when one considers the storage sizes and daily rates of data transfer we’re dealing with. Data corruption can no longer be ignored as just a “theoretical” issue. And if you think modern applications protect against this problem, I’ve got bad news for you: In 2017, researchers at the University of Wisconsin uncovered serious problems for some storage systems when they introduced bit errors into some well-known and widely used applications.

Checksums Came at a Cost that’s Worth its Price Today

When NFS was designed, file writes and the general amount of data were relatively small and checksum computations were very expensive. Hence, the decision to rely on TCP checksums for data protection seemed reasonable. Unfortunately, these checksums proved to be too weak, especially when transferring more than 64k bytes per packet – which easily happens when you transfer Gigabytes per second. What about Ethernet checksums, you ask? They are indeed stronger. However, they don’t allow for end-to-end protection and opportunities for data corruption are manyfold: Cut-through switches that don’t recompute checksums and kernel drivers for NICs are just two examples where things can go horribly wrong.

Checksums and the End of Silent Data Corruption

Experts have seen such silent data corruption happen, even in mid-sized installations. In one instance, enterprise administrators were informed that their data corruption happened in transit. At that point, they began investigating the network stack. It turned out to be a driver-related issue that occurred after a kernel update broke the TCP offload feature of their NICs. Tracking down the problem was both difficult and time-consuming.

That’s where end-to-end checksums come in. In one use case, as soon as the system receives data from the operating system, each block (usually 4k bytes, but that can be adjusted in the volume configuration) is checksummed. Because this checksum stays with the data block forever, the data is protected – even against software bugs – as it travels through the software stack. The checksum is validated along the path throughout the life of the data – even at rest when the data isn’t accessed (via periodic disk scrubbing). All this is possible because dated legacy protocols like NFS are not relied on. Instead, an RPC protocol where each data block, and the message itself, are checksum-protected. And since modern CPUs have built-in CRC32 computation capabilities, there’s no longer a performance penalty for using CRCs.

About the Author:

Björn Kolbeck is the co-founder and CEO of Quobyte. Before taking over the helm at Quobyte, Björn spent time at Google working as tech lead for the hotel finder project (2011–2013) and he was the lead developer for the open-source file system XtreemFS (2006–2011). Björn’s PhD thesis dealt with fault-tolerant replication.

Checksums in Storage Systems and Why the Enterprise Should Care

Recent Posts

Archives