Originally posted on VMBlog

By Bob Shine, vice president marketing and product management at Telescent

In the technology world, 2023 was dominated by headlines about machine learning (ML) programs such as ChatGPT and Dall-E. Large language models (LLMs) and generative AI fascinated people with their ability to generate text following almost any prompt, and an image created by the generative AI program Midjourney even won an art contest. This interest in ML has upended hyperscaler data center’s growth plans, forcing them to find ways to scale even faster than they have in the past. SemiAnalysis put this in perspective by stating that Microsoft is currently conducting the largest infrastructure buildout humanity has ever seen, with a planned $50 billion investment in AI-centric data centers in 2024.

However, deploying the Graphic Processing Units (GPUs) used for machine learning are unlike deploying traditional Central Processing Unit (CPUs) used in data centers. As an example of the challenges, Meta froze the development of a $1.5 billion data center in Alabama to redesign the center to handle new AI workloads. While a new generation of hardware offers efficiency improvements over prior generations, the rapid growth of ML and the power demands of new GPU chips are forcing data center operators to bring in new technologies that can deploy these new workloads quickly while greatly improving efficiency.

New Automated Optical Switches Will Improve Ability to Scale Data Centers Quickly

Deploying equipment in data centers is done in stages – allowing individual data halls to be brought online and generate revenue as soon as they are completed. However, as additional data halls are built, these need to be connected to the prior data halls in a process called re-striping. In the past, this re-connection of all the equipment was done by hand and could involve disconnecting and reconnecting thousands of fiber optic interfaces. This process was slow, and even the best technician could have an error rate of 5%, leading to rework and slowing the process down even more.

Google recently announced the use of optical circuit switches (OCS) to replace electrical spine switches in their network architecture. According to Google, the use of OCS reduced power consumption by over 40% while improving throughput by 30%, incurring 30% less cost and delivering 50x less downtime than the best alternative.

New deployments of machine learning clusters will continue to grow to handle the demand for larger data sets with cluster sizes exceeding 10,000 GPUs and will require new interconnection technologies.  2024 will see other hyperscalers deploying novel OCS, including high-radix robotic OCS that not only can handle over 1,000 ports per system but also can handle connections with 8 or 16 fibers per port, leading to systems managing 10,000 fibers or more.

High Power Consumption of GPUs will Drive the Transition to Liquid Cooling

With the newest GPU chip consuming almost an order of magnitude more power than a traditional CPU, efficiently removing this heat requires new and more efficient technology. While the idea of liquid cooling has been discussed for years, a prediction is that the increased deployment of kW scale GPUs will cause 2024 to be the year when liquid cooling is deployed in scale.

The Need for Power Efficiency to Run GPUs

With GPUs requiring significantly more power than CPUs, the power available to data centers can be a limitation. This increased need for power has even caused construction delays and restrictions in some locations such as Ashburn, VA; Dublin, Ireland; and Singapore, where the available power infrastructure can’t meet the demand.

To continue reading the full article please click here.