OpenAI is releasing a new model called o1, the first in a planned series of “inference” models that are trained to answer more complex questions faster than humans. It was released alongside the o1-mini, a smaller, cheaper version. Yes, if you’re into AI rumors: this is actually the strawberry model that’s been getting a lot of hype.
For OpenAI, o1 represents a step toward its broader goal of human-like artificial intelligence. More practically, it does a better job than previous models at writing code and solving multi-step problems. But it’s also more expensive and slower to use than GPT-4o. OpenAI calls this version of o1 a “preview” to emphasize how new it is.
ChatGPT Plus and Team users can access o1-preview and o1-mini starting today, while Enterprise and Edu users will gain access early next week. OpenAI said it plans to provide o1-mini access to all free users of ChatGPT, but has not yet set a release date. The developer’s access rights to o1 are real Expensive: In the API, o1-preview is priced at $15 per 1 million input tokens or text blocks parsed by the model, and $60 per 1 million output tokens. By comparison, GPT-4o costs $5 per 1 million input tokens and $15 per 1 million output tokens.
Jerry Tworek, OpenAI’s head of research, told me that the training behind o1 is fundamentally different from its predecessor, although the company is coy on specifics. He said o1 “has been trained using a new optimization algorithm and a new training data set specifically tailored for it.”
OpenAI teaches previous GPT models to mimic patterns in its training data. Through o1, it trains the model to solve problems on its own using a technique called reinforcement learning, which teaches the system through rewards and punishments. It then processes the query using “thought chains,” similar to the way humans approach problems by solving them step-by-step.
OpenAI says the model should be more accurate thanks to this new training method. “We noticed that this model had fewer hallucinations,” Tworek said. But the problem remains. “We can’t say we solved the hallucination.”
According to OpenAI, the main difference between this new model and GPT-4o is that it can solve complex problems such as coding and mathematics much better than the previous model, while also being able to explain its reasoning.
“The model was definitely better at solving the AP math test than I was, and I minored in math in college,” Bob McGrew, OpenAI’s chief research officer, told me. He said that OpenAI also tested o1 in the International Mathematical Olympiad Qualification Examination. Although GPT-4o only solved 13% of the problems correctly, o1 scored 83%.
“We can’t say we solved the illusion”
The new model achieved 89% participation in an online programming competition known as the Codeforces competition, and OpenAI claims that the next update of the model will “work with PhD students on challenging subjects in physics, chemistry, and biology.” Performance is similar on benchmark tasks.”
At the same time, o1 is inferior to GPT-4o in many aspects. It does poorly on factual knowledge about the world. It also doesn’t have the ability to browse the web or process files and images. Still, the company believes it represents an entirely new class of capabilities. It is named o1, which means “reset the counter to 1”.
“Let’s be honest: I think we’ve traditionally been terrible at naming,” McGrew said. “So I hope this is the first step toward a newer, sane name that better communicates to the rest of the world what we’re doing.”
I wasn’t able to demo the o1 in person, but McGrew and Tworek showed it off to me during a video call this week. They asked it to solve this puzzle:
“A princess is as old as the prince when her age is twice the prince’s age, and when her age is half the sum of their current ages, then the princess is as old as the prince. What are the ages of the prince and princess? Provide all solutions to this problem plan.
The model buffered for 30 seconds and then gave the correct answer. OpenAI designed an interface to demonstrate the reasoning steps when thinking about the model. What struck me wasn’t that it demonstrated its work—GPT-4o can do that, if that’s a clue—but how deliberately o1 seemed to mimic the human mind. Phrases like “I’m curious,” “I’m thinking,” and “Okay, let me see,” create the illusion of step-by-step thinking.
But this model can’t think, and it’s certainly not human. So why is it designed to look like this?
Tworek said OpenAI does not believe in equating artificial intelligence model thinking with human thinking. But he said the interface is designed to show how models can take more time to process and solve problems more deeply. “In some ways, it feels more human than the previous model.”
“I think you’ll find that it feels a little alien in a lot of ways, but also in some ways that it feels surprisingly human,” McGrew said. The model has a limited amount of time to process the query, so it might say, “Oh, I’m running out of time, let me find the answer quickly.” Early on, in its thought chain, it might also look like it’s brainstorming, and say “I could do this or that, what should I do?”
Move towards agents
Large language models are not that smart as they exist today. They essentially just predict sequences of single words to give you answers based on patterns learned from large amounts of data. Take ChatGPT, for example, which often incorrectly claims that the word “strawberry” has only two R’s because it doesn’t decompose the word correctly. Regardless, the new o1 model does answer the query correctly.
OpenAI is reportedly looking to raise more funding at a jaw-dropping valuation of $150 billion, with its motivation hinged on more research breakthroughs. The company is bringing reasoning skills to the LL.M. as it sees the future of autonomous systems or agents capable of making decisions and taking action on your behalf.
For artificial intelligence researchers, cracking reasoning is an important step toward human-level intelligence. The idea was that if a model could do more than pattern recognition, it could enable breakthroughs in fields like medicine and engineering. But for now, o1’s inference capabilities are relatively slow, unlike agents, and expensive for developers to use.
“We spent months theorizing because we thought this was actually a critical breakthrough,” McGrew said. “Fundamentally, this is a new model paradigm that can solve really hard problems and move toward human-like levels of intelligence.”