NVIDIA "Vera" CPUs: Hardware Compatibility Challenges with Third-Party Accelerators
NVIDIA has introduced its "Vera" CPUs as standalone system-on-chips (SoCs), targeting the data center market and positioning them as direct competitors to Intel Xeon and AMD EPYC processors. Unlike traditional CPUs, which are designed for broad compatibility with a range of third-party accelerators, the "Vera" generation is specifically optimized for integration with NVIDIA GPUs. This design focus has led to a significant hardware compatibility issue when "Vera" CPUs are paired with non-NVIDIA graphics cards or accelerators.
Understanding the Hardware Bug in NVIDIA "Vera" CPUs
The core of the issue lies in the way "Vera" CPUs handle PCI Express (PCIe) memory addressing. Under certain conditions, the PCIe controllers within these CPUs generate invalid memory addresses during Memory-Mapped I/O (MMIO) write operations. This problem is particularly pronounced when the CPU writes with partial byte enable to MMIO regions that are mapped using Arm's "Normal Non-Cacheable" (MT_NORMAL_NC) memory attribute.
Arm's relaxed memory ordering for normal non-cacheable attributes can exacerbate the issue, leading to erroneous address generation, data corruption, and even PCIe device failures. These failures are most likely to occur during Direct Memory Access (DMA)-intensive workloads, such as artificial intelligence (AI) training or high-performance computing (HPC) simulations. While NVIDIA GPUs are engineered to work seamlessly with "Vera" CPUs and their specific memory ordering requirements, third-party accelerators—such as AMD GPUs—may encounter reliability problems, including system installation failures and unstable operation.
Software Workarounds and Performance Implications
To address these compatibility challenges, NVIDIA has implemented hardware-specific workarounds within its custom Linux kernels. In particular, NVIDIA's NV-Kernel repository includes a Linux kernel patch that converts the MT_NORMAL_NC memory attribute to Device-nGnRE (non-Gathering, non-Reordering, Early acknowledgement). This conversion enforces stricter memory ordering for DMA-coherent mappings, reducing the risk of data corruption and device failure. While this solution helps maintain system stability, it can introduce higher latency in certain I/O-sensitive workloads, potentially impacting performance compared to the default normal non-cacheable memory configuration.
Industry Context: Similar Issues with Other Arm-Based CPUs
NVIDIA is not alone in facing these hardware compatibility challenges. Ampere Computing's Altra Arm-based CPUs have exhibited similar behavior, with their PCIe controllers generating invalid addresses during MMIO writes under specific workloads. Ampere has also relied on Linux kernel-based runtime modifications to resolve these issues. The recurrence of this problem across different vendors suggests that the root cause may be linked to Arm's memory handling for external devices, particularly in the context of relaxed memory ordering.
Notably, Ampere's software-based solutions have not resulted in any reported performance degradation, indicating that such kernel-level workarounds can be effective in maintaining system reliability without significant trade-offs.
Conclusion
The release of NVIDIA's "Vera" CPUs marks a significant step in the evolution of data center hardware, but it also highlights the complexities of ensuring broad compatibility in heterogeneous computing environments. As more vendors adopt Arm-based architectures and pursue tight integration between CPUs and accelerators, addressing memory ordering and PCIe compatibility issues will remain a critical focus for both hardware and software developers.