Compute Express Link (CXL) is an emerging technology that provides a new, standardized communication channel for CPUs. Layered on top of a PCIe bus, it provides CPU-to-CPU and CPU-to-memory capabilities.
Most commonly, its intended use is to connect additional memory to a CPU. I’ve previously written about it being one of the most recent steps in composable infrastructure, breaking the fixed relationship between CPU and memory that’s been dictated by the motherboard for the past two decades.
This article summarizes several presentations Magnition has done for various SNIA conferences, such as this one.
CXL conceptually allows a server to access racks of memory, breaking open the high core count of modern CPUs and memory-intensive workloads. It’s even been touted as a way of recycling memory from systems that have otherwise outlived their usefulness.
For all the positives of CXL, it does represent a new complexity of having to deal with numerous banks of memory that can have widely varying access speeds. The CXL links also introduce some latency that can vary based on the distance and topology of the CXL links. The end result is a variety of memory that has a variety of response times.
Tools to choose Tiers
CXL memory pools take advantage of Non-uniform memory access (NUMA) abilities built into modern CPUs. NUMA was intended originally to work with the idea that in a multi-processor compute system, some memory would be attached directly to the local CPU (that is, the CPU a process is running on), but it would be able to access memory directly attached to a partnering CPU. Because the memory attached to the local CPU can be accessed faster than the memory of the partners, the memory accesses (and latency) were different, or non-uniform.
CXL latched onto the concept of NUMA, where each pool of memory controlled by CXL becomes a new NUMA node for the processor. A NUMA node is a collection of memory connected to a particular device. On x86 architectures, the NUMA nodes are enumerated at boot time. Current Linux implementations calculate “distance” values, giving a weight to the latency of the NUMA node.
This example also shows nodes 0 and 1 are processor nodes with associated CPU cores. By contrast, nodes 2 and 3 have no cores and represent memory-only CXL nodes.
Linux and VMware ESXi provide methods to provision applications and virtual machines to run in particular NUMA nodes. By default, both typically will attempt to allocate memory on the local, or least distant, NUMA node first, then continue to other NUMA nodes as memory is exhausted.
With Linux, the numactl command is used to restrict, prefer or interleave NUMA nodes. Particular processes can be bounded on how they make use of the memory attached to a system.
VMware ESXi provides a variety of methods for dealing with different NUMA nodes. Particular nodes or interleaves of nodes can be assigned to individual VMs. VMs are also capable of having multiple vNUMA nodes, themselves, that reflect NUMA nodes on the underlying hardware.
Unpredictability
The primary challenge of CXL with multiple NUMA nodes is getting predictable performance. The latency of the different nodes affects the runtime performance of applications. If applications start in arbitrary order by allocating memory from the best available NUMA node, their performance will never be predictable.
The example illustrated here shows six different individual NUMA nodes (in purple) for each CPU, each likely to have a different “distance” and latency associated with them. Modern CPUs have hundreds of cores, capable of running many large applications or VMs.
These applications will never run deterministically unless their memory is allocated consistently.
Consistency is the hobgoblin of performance engineers. The previous section discussed how to assign particular NUMA nodes for particular purposes. But you still need to figure out how to correctly make those assignments.
Load Optimization
Different applications and workloads have different memory demands. Some, like AI inference and learning, have both a large working set in memory and high demand to access all of it quickly. Others, like in-memory databases or cache controllers, have a large working set but only some of it is accessed at any given time.
Figuring out how to allocate for these workloads becomes complex. It might even make sense to have the allocations move from one NUMA node to another. This would be the case of a process that had different demands during different stages of execution. It could also be the case of a workload migrated from one processor to another, such as with a VMware vMotion. Dynamically determining need adds to the complexity.
There’s currently a SNIA technical working group pursuing assisted migrations of different CXL blocks named the Smart Data Accelerator Interface (SDXI). Magnition is a member of this technical working group. SDXI defines interfaces to offload moving memory segments from one NUMA node to another.
Simulating Workloads
Magnition’s core competency is in analyzing workload characteristics and providing modeling to make optimizations. As an experiment, Magnition built a behavioral model of a compute system with multiple cores and three NUMA nodes. The NUMA nodes represented local memory, CXL-attached “near” memory (inside the server) and CXL-attached “far” memory (attached via external PCIe), each with distinct latencies. Three large, arbitrary (synthetic) applications were built to use roughly one third of the total available memory on the system.
First, as a baseline, the three applications were run on a model with no CXL and a swap disk.
We then ran simulations starting each application separately then adding in the other applications at later intervals. This is representative of the performance results we saw.
We ran three experiments, all with similar results. With a single application (workload), it resides in memory and its latency is immeasurably small in terms of the latency of the multiple applications where swap is required.
We then modeled with the two banks of CXL. The bank of local CXL has a modeled latency of 100ns, and the far CXL had a latency of 200ns. These two banks were each equal in size to the local node, and roughly the size of the working set of each application. In this example, there should be no need for any swap resources, because the applications can run in memory.
We started each application after a delay and started them in different orders in different experiments. In these simulations, memory was allocated on demand following Linux defaults. That is, the nearest memory is allocated first (local DDR, direct CXL, then far CXL). The results varied to some extent based on delay and order, but this is representative of the results where both latency and overall runtime increased as the applications were added.
This graph shows running one application (workload) followed by two applications at the same time, then all three. The latency of each application is broken out to demonstrate how the latency of each is inconsistent from run to run. This is the type of variation that drives performance engineers nuts. Memory is available to the given applications but latency and runtime vary based on running order.
Using numactl, each application is pinned to a particular NUMA node. Various simulations can determine the best placement, but it’s easy to demonstrate that we now get consistent latency and runtimes.
Each application has the same memory latency, again as shown from the latency of each from the Memory Latency bars. More importantly, the applications run to completion in the same amount of time. Notice from the legend that all three applications are running to completion faster than when CXL memory was allocated at the whim of the OS.
Managing CXL performance and predictability
Each application has the same memory latency, again as shown from the latency bars. More importantly, all three applications complete in a consistent and faster time compared to when the OS dynamically allocated CXL memory. Deterministic performance depends on smart, workload-aware memory assignment.
That’s the challenge — and the opportunity — of CXL tiering.
CXL enables organizations to right-size memory and break the rigid CPU-memory pairing of traditional architectures. But with flexibility comes complexity: applications now compete for memory pools with radically different latencies. If you don’t explicitly control for this, your system becomes unpredictable, and performance tuning turns into a guessing game.
What’s Next?
If you’re exploring CXL or already testing CXL-enabled architectures, the real differentiator is how well your system adapts to dynamic workloads. Magnition can help you model these environments, simulate the impact of memory allocation strategies, and optimize placement before you go to production.
- Want consistent performance across workloads?
- Curious how your workloads behave across different memory tiers?
- Need to reduce engineering guesswork?
Magnition’s modeling framework is built to do just that. Get in touch to see how you can design and simulate predictable performance from day one.