The AI Memory Crisis: WEKA’s CEO on the Industry’s Hidden Bottleneck

Analysts worried about an artificial intelligence bubble often point to the staggering costs of computing power needed to power the latest models. Even the most powerful graphics processing units—like Nvidia’s 300GB Blackwell Ultra—can’t keep up with inference demands for models like Meta Llama, which can process nearly 500GB for every instance of use.

While training consumes vast amounts of compute, inference—the work of actually serving users—runs up against memory limitations. In a recent conversation with The Information Executive Editor Amir Efrati, Liran Zvibel, CEO of WEKA, an AI storage company powering many of the world’s leading frontier labs and AI clouds, said this “memory wall” is quickly becoming AI’s hidden bottleneck.

A Waste of GPUs

According to Zvibel, much of today’s GPU horsepower is being squandered. Infrastructure built for training is being repurposed for inference.

“When you’re training a model, you’re compute bound,” noted Zvibel. Inference is the opposite: It’s memory bound. Even the newest GPUs only have a few hundred gigabytes of very high-performance memory.

“When you’re looking at a 100,000-token window, which is not ridiculous for any of these modern models, it’s 50 gigabytes,” he said. In scaling to just a few users, an AI agent can quickly run out of available memory.

“We are limiting the number of concurrent users we can run by the amount of memory we have on these machines. This is what we’ve nicknamed the AI memory wall,” Zvibel added.

The result is familiar to anyone who’s waited for ChatGPT to respond, only to experience lagging outputs and rate limits. When AI model providers hit the memory wall, customers suffer the downstream effects.

“They’re not only wasting GPUs, they’re giving bad service to their end users,” he said.

The Coming Inference Crunch

If today’s models are already pushing memory to its limits, tomorrow’s will push even harder.

“Agentic AI is going to make it worse,” Zvibel said, warning that smarter models will demand longer context windows, more reasoning ability and more memory for verification. “Before the number of agents blows up, we have to rein in that problem.”

Increasing AI Infrastructure Efficiency Through Memory

Zvibel drew a sharp line between training and inference economics: Training spend is discretionary, but inference eventually has to pay for itself.

“With training, there’s no amount of spend that doesn’t make sense,” he said. “But inference has to correlate between the world’s population, which is their addressable market, and the resources you have.”

Some labs are pioneering efficiency strategies. Zvibel pointed to DeepSeek as one of the first to show “that you can win efficiency through memory” with optimizations like key-value caching and disaggregated prefill. Cohere, a WEKA customer running on CoreWeave, has demonstrated that warming up GPU servers for inference can be cut from 15 minutes to seconds.

“You can reduce time to first token by half and increase the number of concurrent tokens by four to five times,” Zvibel said.

That efficiency could ease the burden on labs—and their balance sheets. Earlier this year, The Information reported that inference consumed nearly 60% of OpenAI’s revenue.

AI Infrastructure for the Inference Era

Given the staggering cost of building GPU-based clusters, a looming question is how long these chips remain useful. Zvibel believes older hardware will find a second life in inference workloads, provided infrastructure can be tuned to disaggregate compute tasks.

“The big labs will always want access to the latest and greatest GPUs so they can run training and win the race,” he said. “But they will keep the older stuff, which they can use for different parts of the flow.”

The hardest part of inference—prefill, where attention is calculated—belongs on the strongest GPUs, while decoding can be offloaded to older models.

Scaling the memory wall isn’t just about better performance today—it’s about making AI infrastructure more economical, scalable and sustainable for the long haul.

As agentic adoption grows, the memory crisis will only become more acute without efficient infrastructure that can make AI truly cost-effective.

As Zvibel put it: “Unlike training, where you need to win on the outcomes, inference must win on economics.”