By JB Baker, Vice President of Products at ScaleFlux
The onslaught of demand for artificial intelligence (AI) systems brings with it a tremendous challenge for data center operations teams, regardless of whether they manage a handful of racks or many rows of racks. This challenge is the power supply to run the GPU-based systems. In discussing their challenges and surveying data center operations teams, 53% of respondents noted “improving power efficiency” as a problem they consistently encounter.
The Power Supply Struggle
At the macro level, data centers are significant global electricity consumers, with regions like Northern Virginia alone requiring the equivalent output of nearly four nuclear reactors. With electricity demand projected to double in the coming years, the strain on power supplies and the environmental impact of such consumption cannot be overstated.
Auguring down to the individual data center where a data center infrastructure and operations team or an IT procurement team fights their daily battles does not improve the picture much. Each data center has a power supply capacity ranging from a few kilowatts (kW) to single-digit Megawatts (MW) for smaller data centers to 100MW+ for large and hyperscale data centers. Regardless of the scale, many data centers already run close to maximum power capacity, leaving little room for expanding services and adding new hardware.
The Power Density Surge
Replacing general-purpose servers with the GPU-based systems typically deployed for AI workloads is not as simple as swapping them out 1 for 1. GPU-based systems are significantly more power-dense, that is “Watts of processor power per server unit” or “per rack unit.” A rack of traditional x86-based systems may need 10-16kW of power. A full rack of GPU-based systems could draw 65-100kW, for up to 10x the power density! That is, assuming one could (a) supply that much power to the rack and (b) find a means to cool it. Suffice it to say, deploying AI systems and services within established data centers requires some tradeoffs and sacrifices to existing services.
The Efficiency Imperative
Finding enhancements to efficiency, measured as “work per watt of energy consumed,” has gone from a goal to the new gold. Each new generation of processors and electronic components brings with it some improvements to work per watt. Even the AI infrastructure itself brings improvements to compute capability per watt, which helps in meeting the demands for AI workloads. However, these evolutionary improvements are not enough, and other innovations must be deployed further enhance efficiency.
These other innovations include new cooling methods and new, data-centric processing architectures. The adoption of new cooling methods, such as liquid and immersion cooling, both help solve for managing the heat that comes with the increasing power density and improve the cooling efficiency. But they also present challenges. Immersion cooling, for example, can require significant retrofitting, which may not be either economically or physically viable for all facilities. Data-centric processing and domain-specific compute, which focus on moving computational tasks to the data (instead of moving the data to a central processor) and on using purpose-built compute engines instead of general-purpose cores, both contribute to reducing the energy needed to complete a task.
Computational Storage: A Solution
Computational storage drives (CSDs) have emerged as a promising innovation in domain-specific compute and data-centric processing to address these challenges. By integrating computing functions directly within storage devices, CSDs allow for processing tasks to be distributed more efficiently, reducing the energy required for data movement and enhancing overall system performance. CSDs also integrate dedicated hardware compute engines (HCE) which consume orders of magnitude less energy to perform their tasks than general-purpose processors. This architectural innovation not only improves power efficiency and sustainability at the device level but also across the entire data center infrastructure.
Navigating Expansion within Limits
Data center operators striving to expand their services face significant hurdles, including limited physical space, power supply, and cooling systems. Computational storage is a tool to help navigate these barriers, offering improvements in efficiency that can alleviate the need for extensive physical expansion. By reducing the hardware footprint and electricity usage, CSDs enable data centers to scale their operations more sustainably within existing constraints.
Conclusion
As data centers continue to battle with finding power to meet their AI initiatives and expand current services, the need for innovative solutions to manage energy consumption and operational efficiency is of paramount concern. Computational storage stands out as a key technology in this regard, helping data centers to meet their expansion goals while navigating the constraints of power supply and physical space.
About JB Baker and ScaleFlux
JB Baker, Vice President of Products at ScaleFlux, brings over 20 years of experience and insight in data storage into the forefront of computational storage technology. ScaleFlux is redefining the IT infrastructure landscape, championing innovation in the compute, memory, and storage pipeline to unlock unprecedented performance, efficiency, and scalability.
Follow JB on LinkedIn for more insights on data storage technologies and trends.
JB Baker’s X Handle: https://twitter.com/JBBakerSSD
Scaleflux’s X handle: @ScaleFlux