Exclusive: OpenAI Projected at Least 220 Million People Will Pay for ChatGPT by 2030Save 25% per year for 2 years

The Information
Sign inSubscribe

    Data Tools

    • About Pro
    • The Next GPs 2025
    • The Rising Stars of AI Research
    • Leaders of the AI Shopping Revolution
    • Enterprise Software Startup Takeover List
    • Org Charts
    • Sports Tech Owners Database
    • The Information 50 2025
    • Generative AI Takeover List
    • Generative AI Database
    • AI Chip Database
    • AI Data Center Database
    • Cloud Database
    • Creator Economy Database
    • Tech IPO Tracker
    • Tech Sentiment Tracker
    • Sports Rights Database
    • Tesla Diaspora Database
    • Gigafactory Database
    • Pro Newsletter

    Special Projects

    • The Information 50 Database
    • VC Diversity Index
    • Enterprise Tech Powerlist
  • Org Charts
  • Tech
  • Finance
  • Weekend
  • Events
  • TITV
    • Directory

      Search, find and engage with others who are serious about tech and business.

    • Forum

      Follow and be a part of discussions about tech, finance and media.

    • Brand Partnerships

      Premium advertising opportunities for brands

    • Group Subscriptions

      Team access to our exclusive tech news

    • Newsletters

      Journalists who break and shape the news, in your inbox

    • Video

      Catch up on conversations with global leaders in tech, media and finance

    • Partner Content

      Explore our recent partner collaborations

      XFacebookLinkedInThreadsInstagram
    • Help & Support
    • RSS Feed
    • Careers
  • About Pro
  • The Next GPs 2025
  • The Rising Stars of AI Research
  • Leaders of the AI Shopping Revolution
  • Enterprise Software Startup Takeover List
  • Org Charts
  • Sports Tech Owners Database
  • The Information 50 2025
  • Generative AI Takeover List
  • Generative AI Database
  • AI Chip Database
  • AI Data Center Database
  • Cloud Database
  • Creator Economy Database
  • Tech IPO Tracker
  • Tech Sentiment Tracker
  • Sports Rights Database
  • Tesla Diaspora Database
  • Gigafactory Database
  • Pro Newsletter

SPECIAL PROJECTS

  • The Information 50 Database
  • VC Diversity Index
  • Enterprise Tech Powerlist
Deep Research
TITV
Tech
Finance
Weekend
Events
Newsletters
  • Directory

    Search, find and engage with others who are serious about tech and business.

  • Forum

    Follow and be a part of discussions about tech, finance and media.

  • Brand Partnerships

    Premium advertising opportunities for brands

  • Group Subscriptions

    Team access to our exclusive tech news

  • Newsletters

    Journalists who break and shape the news, in your inbox

  • Video

    Catch up on conversations with global leaders in tech, media and finance

  • Partner Content

    Explore our recent partner collaborations

Subscribe
  • Sign in
  • Search
  • Opinion
  • Venture Capital
  • Artificial Intelligence
  • Startups
  • Market Research
    XFacebookLinkedInThreadsInstagram
  • Help & Support
  • RSS Feed
  • Careers

Answer tough business questions, faster than ever. Ask

Partner Content

The AI Memory Crisis: WEKA’s CEO on the Industry’s Hidden Bottleneck

The AI Memory Crisis: WEKA’s CEO on the Industry’s Hidden BottleneckPhoto by Craig Warga and Jamie Watts
By
The Information Partnerships
[email protected]Profile and archive

Analysts worried about an artificial intelligence bubble often point to the staggering costs of computing power needed to power the latest models. Even the most powerful graphics processing units—like Nvidia’s 300GB Blackwell Ultra—can’t keep up with inference demands for models like Meta Llama, which can process nearly 500GB for every instance of use.

While training consumes vast amounts of compute, inference—the work of actually serving users—runs up against memory limitations. In a recent conversation with The Information Executive Editor Amir Efrati, Liran Zvibel, CEO of WEKA, an AI storage company powering many of the world’s leading frontier labs and AI clouds, said this “memory wall” is quickly becoming AI’s hidden bottleneck.

A Waste of GPUs

According to Zvibel, much of today’s GPU horsepower is being squandered. Infrastructure built for training is being repurposed for inference.

“When you’re training a model, you’re compute bound,” noted Zvibel. Inference is the opposite: It’s memory bound. Even the newest GPUs only have a few hundred gigabytes of very high-performance memory.

“When you’re looking at a 100,000-token window, which is not ridiculous for any of these modern models, it’s 50 gigabytes,” he said. In scaling to just a few users, an AI agent can quickly run out of available memory.

“We are limiting the number of concurrent users we can run by the amount of memory we have on these machines. This is what we’ve nicknamed the AI memory wall,” Zvibel added.

The result is familiar to anyone who’s waited for ChatGPT to respond, only to experience lagging outputs and rate limits. When AI model providers hit the memory wall, customers suffer the downstream effects.

“They’re not only wasting GPUs, they’re giving bad service to their end users,” he said.

The Coming Inference Crunch

If today’s models are already pushing memory to its limits, tomorrow’s will push even harder.

“Agentic AI is going to make it worse,” Zvibel said, warning that smarter models will demand longer context windows, more reasoning ability and more memory for verification. “Before the number of agents blows up, we have to rein in that problem.”

Increasing AI Infrastructure Efficiency Through Memory

Zvibel drew a sharp line between training and inference economics: Training spend is discretionary, but inference eventually has to pay for itself.

“With training, there’s no amount of spend that doesn’t make sense,” he said. “But inference has to correlate between the world’s population, which is their addressable market, and the resources you have.”

Some labs are pioneering efficiency strategies. Zvibel pointed to DeepSeek as one of the first to show “that you can win efficiency through memory” with optimizations like key-value caching and disaggregated prefill. Cohere, a WEKA customer running on CoreWeave, has demonstrated that warming up GPU servers for inference can be cut from 15 minutes to seconds.

“You can reduce time to first token by half and increase the number of concurrent tokens by four to five times,” Zvibel said.

That efficiency could ease the burden on labs—and their balance sheets. Earlier this year, The Information reported that inference consumed nearly 60% of OpenAI’s revenue.

AI Infrastructure for the Inference Era

Given the staggering cost of building GPU-based clusters, a looming question is how long these chips remain useful. Zvibel believes older hardware will find a second life in inference workloads, provided infrastructure can be tuned to disaggregate compute tasks.

“The big labs will always want access to the latest and greatest GPUs so they can run training and win the race,” he said. “But they will keep the older stuff, which they can use for different parts of the flow.”

The hardest part of inference—prefill, where attention is calculated—belongs on the strongest GPUs, while decoding can be offloaded to older models.

Scaling the memory wall isn’t just about better performance today—it’s about making AI infrastructure more economical, scalable and sustainable for the long haul.

As agentic adoption grows, the memory crisis will only become more acute without efficient infrastructure that can make AI truly cost-effective.

As Zvibel put it: “Unlike training, where you need to win on the outcomes, inference must win on economics.”

Most Popular

  • AI AgendaGoogle Unseats Anthropic With Gemini 3
  • AI AgendaWhy OpenAI Should Worry About Google’s Pretraining Prowess
  • DealmakerIn Las Vegas, Kalshi Is King
  • The ElectricThe Electric: Look for Gasoline Cars to be Crowding Roads for Decades Longer

Recommended