Machine Learning’s Monster Appetite: How AI Growth Is Pushing Optical Limits and Creating Fiber Management Challenges

Tags: AI era • Artificial Intelligence (AI) • fiber management • GPU clusters • large language models (LLMs) • MPO (Multi-Fiber Push On) • optical bandwidth • optical lines rates • parallel optics • Telescent

Unveiling the Power of Large Language Models (LLMs) | by Harishdatalab | Medium

Originally posted on Telescent

The ever-growing complexity of machine learning (ML) models is fueling a data deluge. These models, used in everything from recommendation engines to large language models (LLMs), require massive amounts of data for training. While Moore’s Law has dominated computer hardware development for over 4 decades, the growth in date for training is outpacing Moore’s Law. Rather than doubling every 2 years, the size of data sets used in large language model has grown 10-fold every year. Since the size of these data sets drives data exchange during ML training iterations, this translates to a growing need for optical bandwidth to minimize the idle time of the GPU processors.

However, our current optical networks are struggling to keep pace. Traditionally, the solution has been to increase optical line rates, the speed at which data travels through a single fiber. But pushing these rates to ever-higher limits becomes increasingly challenging due to physical limitations beyond 200 Gbps. Wavelength division multiplexing (WDM) offers a way to pack more data on a fiber by transmitting over different wavelengths on a single fiber, and this has been used very extensively in long-haul optical networks. But WDM does increase costs due to the complexity and additional optical components required.

This is where parallel optics come in. Imagine a highway with multiple lanes – parallel optics uses multiple fibers to transmit data simultaneously, effectively creating a multi-lane information superhighway. This approach increases bandwidth while keeping individual line rates manageable. This trend is evident in the standards for 800 G and 1.6T optical transceivers. While there are options to meet the various use cases for transmission distances and power requirements, the standard bodies are defining pluggable form factors with 8, 16 and even 32 fibers per module to meet the transmission speed requirements for machine learning applications.

But with more fibers comes a new challenge: fiber management. The sheer number of fibers needed to support large GPU clusters training massive machine learning models can quickly become overwhelming for data center operators. Here’s where robotic automation steps up as a game changer.

Robot Revolution: Automating Fiber Management for the AI Era

As discussed, parallel optics using multiple fibers is key to handling the bandwidth demands of ever-larger machine learning GPU clusters. But managing tens of thousands of fibers in an ML cluster can become a logistical nightmare. This is where robotic automation comes in, not just for managing a multitude of fibers, but also for improving performance of the optical network.

Robotic fiber management systems utilize robotic arms equipped with specialized grippers designed for handling fiber optic cables and connectors. These robots can efficiently connect, disconnect, and route individual fibers within high-density panels that pack hundreds to thousands of connectors into a single rack unit. A key advancement made by Telescent was the development of a routing algorithm that allows the robot to weave the fiber around all the existing connections without entangling or knotting no matter what prior connections we made.

Another benefit of a robotic system is that it doesn’t care what type of connector it is moving and can just as easily reconfigure 16 fibers in an MPO (Multi-Fiber Push On) connector as a simplex fiber in an LC connector – providing a very good match of the new parallel optic transceivers. This is very different from other technologies such as MEMS or piezo technologies that can’t easily scale to systems with thousands of fibers using multi-fiber connectors since each port in these technologies can only manage a single fiber.

Robotic systems can be programmed to precisely connect and disconnect these MPO connectors, ensuring secure and reliable connections while offering a fiber density of over 10,000 fibers per rack – offering efficient use of precious data center real estate. Additionally, a robotic system can include a cleaning step during reconfigurations – maintaining low loss performance even with the high density MPO connectors. This low loss performance is even more critical for new high-speed optics due to their reduced link budget.

But the benefits of robotic automation extend beyond just physical manipulation. Here are more reasons why automation is a game-changer for high-density fiber networks:

Remote Diagnostics: Data centers are often sprawling facilities, and manually troubleshooting fiber connectivity issues can be time-consuming and error-prone. Robotic systems can be integrated with network monitoring hardware such as power monitors and OTDRs, enabling remote diagnostics and allowing technicians to quickly diagnose the issue without physically needing to access the equipment. This not only saves valuable time but can also minimize disruption to critical network operations.
Avoid Human Errors: Even with the best technicians, managing thousands of fibers manually will inevitably lead to errors. Robotic control avoids these errors and all changes are tracked in the database.
Remote Reconfigurability: Robotic control of the fiber connectivity allows for automated restriping of the data center network during expansion or bandwidth optimization during machine learning training. Prior work has demonstrated a 3.4x reduction in training time through network optimization.
Provides Optical Transparency Across Network: According to results published by Google, replacing an electrical switch layer with an optical circuit switch resulted in 41% reduction in power usage and a 30% reduction in capital expense. Another benefit of an optical circuit switch is that this creates optical transparency across the network (avoiding optical – electrical – optical conversion points) which allows interoperability across the network as different generations of transceivers are using the network.
Offers Low Latency Operation: Pure optical interconnects offer the ultimate lowest latency connection, which is critical for optimal performance of AI/ML GPU clusters. Low latency is also important in RDMA operation for high throughput and low latency transfer of information between compute nodes.
Maintains Cleanliness of Fibers: As mentioned earlier, a cleaning step can be included in the reconfiguration process, maintaining low loss performance during the life of the equipment. This is especially important as the link budgets are reduced at higher bandwidths.
Accurate Inventory of All Fibers: By using a robotic patch panel, the state of all connections are known with machine accuracy at all times.

While automated fiber management offers a range of benefits, new technology will only be adopted if it meets economic requirements as well. In this case, the economic requirements include both the cost per fiber as well as the cost of the data center real estate required for the fiber management systems. Other technologies used for optical circuit switches include MEMS and piezo based systems. However, these systems are very expensive per port and are limited to a few hundred fibers per system or less than 1,000 fibers per rack. In contrast, the robotic system from Telescent using MPO connectors can scale to over 12,500 fibers per rack. And at this scale, the cost is dominated by the fiber cost not the robotics so there is minimal additional overhead expense to add robotics compared to a manual patch panel system.

To continue reading the full article, please click here.

Machine Learning’s Monster Appetite: How AI Growth Is Pushing Optical Limits and Creating Fiber Management Challenges

Recent Posts

Archives