XAI Shows How Hard It Is to Use a Lot of GPUs at Once

Art by Mike Sullivan

AI developers have been desperately scrambling to get ahold of Nvidia server chips lately, as we wrote last week. When developers do get the graphics processing units, they’re under a lot of pressure to wring as much performance as possible from that expensive hardware.

That’s easier said than done. Training AI models can be “bursty,” meaning that there can be sudden spikes in GPU usage followed by periods of lower activity when researchers analyze the results and decide what to do next. This leads to what researchers refer to as a lower utilization rate, meaning they aren’t getting the most bang for their GPU buck. (This is less of a problem in AI inference involving finished models, when developers can run them in more predictable or consistent ways.)

Even the biggest AI firms have problems in this regard. Elon Musk’s xAI, for instance, has around 500,000 Nvidia GPUs, one of the largest collections among AI developers based on what they’ve publicly disclosed. But xAI’s Model Flops Utilization—a measure of exactly how much computing power it can eke out of those chips—was around 11% in recent weeks, according to a person who saw the data in an internal memo. (Business Insider earlier reported on the memo.) The MFU rate is an indicator of how effectively a developer is utilizing its chips—a rate of 100%, for instance, would imply full utilization.

To be fair to xAI, everyone struggles with GPU utilization, and a researcher at a rival firm said cracking 40% was difficult for most of xAI’s competitors. But a rate of 11% is appallingly low, the researcher said. And it’s especially surprising given that xAI has a reputation of setting up GPUs in a way that Nvidia recommends.