December 13, 2024
Last summer, a handful of high tech companies started discussing the shortcomings of current networks. In particular, large scale vendors of processors, GPUs and networking hardware were concerned about the scale and speed of interconnects needed for modern workloads.
These concerns become particularly acute with the increasing density of processor cores, GPUs and PCIe channels on the latest-generation hardware. Ultra Ethernet was the result.
Ultra Ethernet was conceived to improve Ethernet for large-scale AI and High Performance Computing (HPC) workloads, including facilitating RDMA. Generally, this involves drastically increasing the number of connected endpoints, as many as a million. To get there, it would require improving reliability without the overhead of retransmissions, relieving congestion, and better end-to-end signaling to facilitate better communication between the connected devices.
These goals can be addressed at the physical link, transport and software layers in a variety of ways. Additional goals include the idea of a converged network; that is a single data center network used for memory interconnects, storage protocols and what is currently considered to be normal network traffic.
At the physical link layer, Ultra Ethernet is interested in sane redundancy. This would potentially mean multi-pathing not capable in the current spanning tree topology. This would also include error recovery methods, including being able to detect and handle bit-error losses. Ultra Ethernet recognizes these shortcomings in Data Center Bridging (DCB), which is baked into current Ethernet standards, and NVMe.
Ultra Ethernet envisions more robust transport services, taking over many of the features now encompassed by TCP. Ultra Ethernet aims to be lossless, provide reliable delivery and allow in-order- or out-of-order-frame sequences. It also envisions single-frame redelivery, disposing of TCP's required sequence retransmission to fill gaps. The plan also includes handling congestion management, obviating TCP windows.
Finally, Ultra Ethernet is looking at changing up the software bindings of today's Ethernet or TCP/IP stacks. The open-source libfabric project provides an API compatible with Ultra Ethernet. It also provides API access for a smorgasbord of applications, including the ability to include offload engines. The thinking is there are a lot of ways of doing the same thing, and Ultra Ethernet is being accommodating.
So where are all of these ideas and work towards Ultra Ethernet coming from? Ethernet has long had some understood deficiencies. But Ethernet switching is ubiquitous, well-supported and comparatively, very inexpensive per-port.
The Ultra Ethernet Consortium (UEC) was formed in July of 2023 by a small handful of companies interested in moving the stone forward in networking. Since then, the consortium has grown to nearly 60 members, divided into three tiers depending on their level of involvement. Though each of the members might not know the end goal, or even intermediate steps along the way, many of them have an interest in what Ultra Ethernet promises.
Many of the members could have individual features in mind that suit their purposes or align with their product development. In some cases, members are likely working on different implementations of similar ideas, or pieces that directly complement other members' work. The consortium works toward a common vision of the next generation of networking, but isn't necessarily working in lockstep design or development.
This means that while the goals enumerated above all loosely fall under the purview of the consortium, they are being pursued independently, and quite possibly redundantly, by the members in furtherance of their own interests.
Manifold challenges face Ultra Ethernet. Today's Ethernet and networking advances continue to be a juggernaut that might prove difficult to take on. Ethernet speeds continue to increase, and new offload capabilities crop up with each new protocol.
CXL and NVMe are supported by consortia and standards that look to tackle some of the same difficulties that UEC sees. While UEC admirably is going after the issues holistically, their efforts run side by side with other industry efforts.
Ultra Ethernet needs to maintain its essence of being general-purpose networking. This has been a stumbling block of many other memory- and storage-focused networking efforts. The endeavors to build a channel-like or bus-like connection between the endpoints failed in being able to appropriately multiplex connections, handle congestion, and tolerate a lack of end-to-end signaling. FCoE was a notable example of this.
Finally, as large as UEC has grown, there needs to be an ability to agree upon approaches. Sixty or more members can provide competing solutions to the same problems, and corporate interests can spark infighting in that large of an effort. The tiered hierarchy of UEC probably helps in facilitating collaboration. The ability to compare and prove the capabilities of differing solutions would also prove valuable.
Magnition provides a system design toolset for building large-scale, complex behavioral discrete models of software and system configurations. This is accomplished through interconnected, modular simulations of the individual components.
Interconnections and configurations can be varied quickly, exposing opportunities to optimize the design. Additionally, Magnition simulations are designed to operate on real workloads, simulating the way the software or systems would be used in application and production environments.
Ultra Ethernet would be applicable to a variety of intensive workloads, such as LLM inference, training, electronic design automation, genetic modeling, and large dataset tasks.
Magnition is able to provide a framework for testing and designing different Ultra Ethernet capabilities and compare them on a variety of workloads. Competitive designs or technical differences can be proven and optimized in design analysis, well before a standard or proposal is put forward.
Magnition has demonstrated experience in providing valuable design-phase tools and expertise for complex system configurations.
There is a lot more to learn about Ultra Ethernet and Magnition. Check it out.