OpenAI Discovers New Way to Cut Inference Costs in Half

Art by Clark Miller.

We closely track efforts by Anthropic, Google and OpenAI to get access to more server chips to run their models. But we don’t talk enough about the work these companies are doing to get more juice from the servers they already have.

In one previously unreported example, OpenAI engineers earlier this month told some colleagues they had figured out a way to more than halve the cost of inference, or running existing models, thanks to some newly-discovered optimizations, according to a person with knowledge of those discussions.

When the engineers applied the new techniques to power ChatGPT for visitors who didn’t have a free or paid account, it reduced the number of Nvidia graphics processing units needed at one point to just a couple hundred—a shockingly small number. (That said, OpenAI likely doesn’t get much ChatGPT usage from such users, as the company limits how much they can use the chatbot that way.)

It isn’t clear what OpenAI did to get its latest efficiency gains, which might include techniques such as quantization; key value-caching, or helping the model remember information from prior calculations it made so it doesn’t need to repeat the work; sending queries to be answered in batches rather than one by one; and routing some queries to models or parts of models that require less power to answer them.