SDS 864: OpenAI’s o3-mini: SOTA reasoning and exponentially cheaper

Podcast Guest: Jon Krohn

February 21, 2025

Subscribe on Apple PodcastsSpotifyStitcher Radio or TuneIn

Jon Krohn investigates OpenAI’s latest release, o3-mini, in this five-minute Friday, where he walks through the reasoning model’s capabilities and performance, cross-examining them against other major-league players, DeepSeek-R1, GPT-4o and Claude 3.5 Sonnet. 

Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.
 
Reasoning models have become generative AI’s newest darlings, with OpenAI’s o1 and DeepSeek-R1 showing how much deeper models can work through queries. While a generative AI tool like Claude 3.5 Sonnet is well capable of returning answers to simple queries, more complex problems that would usually require several steps to solve do fare better with the new reasoning models. 
Nevertheless, we are also starting to see performance differences between these reasoning models. The model has been favourably compared against several benchmarks, including AIME Math, Codeforces, and SWE-Bench Verified. o3-mini is reported to outperform DeepSeek-R1, and its small size makes it relatively cheap to run, again undercutting DeepSeek-R1.
 
Listen to this five-minute Friday to get the details on how to access the proprietary o3-mini and start implementing it into your workflows today, as well as what to watch out for with OpenAI’s forthcoming release, o3. 

ITEMS MENTIONED IN THIS EPISODE

Podcast Transcript

Jon Krohn: 00:02

This is episode number 864 on OpenAI’s o3-mini. 
00:19
Welcome back to the SuperDataScience podcast. I am your host, Jon Krohn. At the time of recording, I’ve been completely crushed by a stomach infection. I’m heavily medicated right now to get through recording this episode, and so the show must go on. And so for today’s five-minute Friday style episode, I’m skipping the preamble and jumping straight to the meat of the episode. Today’s episode will fill you in on everything you need to know about an important model OpenAI recently released to the public called o3-mini. OpenAI’s o3-mini is a reasoning model like DeepSeek’s-R-1 model, which I detailed two weeks ago in episode number 860. And also, it’s a reasoning model like the original, super famous reasoning model o1, which made a huge splash when it was released by OpenAI back in September, in which I covered back in episode number 820. 
01:11
As a quick recap, reasoning models like o1, R1 and now o3 work through problems step-by-step in the background before outputting a response to your query. Compared to models like GPT-4.0 and Claude 3.5 Sonnet that immediately begin streaming their outputs, reasoning models are far more effective at the same kind of tasks that you might tackle step-by-step with pencil and papers such as math problems or challenging coding problems. There are two reasons why this new o3-mini reasoning model is such an important release. First, when left thinking, “thinking” long enough, o3-mini has three modes. So it has a low mode, a medium, and a high mode where high carries out the most inference, time compute. And when it’s left thinking long enough in that o3-mini high mode, o3-mini achieves state-of-the-art performance relative to any other publicly available model on a number of key challenging benchmarks, including the AIME Math Benchmark, the code forces coding benchmark, and the SWE-bench Verified benchmark that consists of challenging real world software engineering problems.
02:13
To be more explicit, this means that o3-mini high outperforms not only o1-mini, but also DeepSeek-R1 and even OpenAI’s much more expensive to run full size o1 model. Which brings me to the second reason why o3-mini is such an important release. Because o3-mini is relatively small, it’s way cheaper than o1 to run. While o1 costs $15 per million input tokens and $60 per million output tokens, o3-mini costs just 7% of that on both input and output. So you’re getting comparable or even better performance on challenging benchmarks with o3-mini relative to o1 and much lower cost. And note that o3-mini is about twice the cost to run relative DeepSeek-R1 on DeepSeek’s cloud infrastructure in China. But if you want to run R1 with a US cloud provider, o3-mini actually costs about half as much to run. 
03:09
So to recap all that, the key points are that o3-mini provides state-of-the-art performance on complex tasks that require step-by-step reasoning, all at bargain prices compared to the first generation of reasoning models. So how can you access this powerful new o3-mini model? Free tier users of ChatGPT can get a taste of o3-mini by selecting reason in the chat box when you make your query. And if you have a paid ChatGPT plan such as ChatGPT Plus, Team or Pro, you can access the o3-mini high model that spends the most time doing inference time computation, but also provides the state-of-the-art capabilities I’ve been touting throughout this episode. You can also use the ChatGPT API to embed o3-mini’s reasoning capabilities into any application your heart desires. I’ve got a link to instructions on how to use the API in the show notes. 
03:52
Depending on your exact application, you can experiment to determine whether o3-mini low, medium, or high is ideal for your use case, noting that of course, your compute time and financial costs will both go up if you opt for o3-mini medium and even more so if you go for o3-mini high. Ultimately, this o3-mini release isn’t as earth-shattering for me as the DeepSeek-R1 release was a few weeks ago because of how R1 is provided open source while o3-mini is completely proprietary. This means that you have even more flexibility with R1 to adapt it to your heart’s content and to use it on whatever infrastructure you desire. But OpenAI does have another card up their sleeve that will presumably be released to the public soon, and that’s quite exciting indeed, that’s o3. 
04:38
So this whole episode, I’ve been talking about o3-mini, but they are about to release… Presumably OpenAI, are about to release the full size o3 model, and that one, o3 has performance that absolutely crushes all other models available today, including DeepSeek-R1, and of course its predecessor, the full size OpenAI o1 model on all the complex and important reasoning benchmarks. So if you actually watch the video version of today’s episode, I’ve got some charts to show this big Delta for o3 relative to all other existing models today. This includes on the math on the AIME benchmark, which I mentioned earlier in this episode, but to go in a bit more detail, it stands for American Invitational Mathematics Examination, A-I-M-E. So yeah, on that benchmark, OpenAI o3 gets a score of 96.7, which is far better than DeepSeek-R1, which came in at 79.8 and was the next closest model other than actually o3-mini high, which came in at 87.3.

05:41
On coding like the CodeForces benchmark, again, o3 absolutely crushes all other existing models. It gets an ELO rating of over 2700 while the next closest models are o3-mini high with about 2100 and DeepSeek-R1 at about 2000. On the SWE-verified benchmark… So the SWE-bench Verified benchmark, Software Engineering Benchmark, I’ve got a link to more details on that benchmark in the show notes. Complex real world software engineering problems are handled in that benchmark. And again, the OpenAI o3 model absolutely crushes all other existing models. According, all of these things are not independently verified yet. These are all stats from OpenAI themselves, so maybe potentially a grain of salt there, but I think they’ve been pretty reliable with their historical releases on this kind of thing. Yeah, so on the SWE-bench Verified benchmark, again, OpenAI o3 gets almost 72… A score of almost 72, whereas the next best model is o3-mini high and DeepSeek-R1 and o1 come in at 49.

06:45
It’s a huge delta that would be very noticeable in a real world application. And then finally, there’s a fourth and final benchmark here, which is related to just being able to answer natural language questions in English. So this is the graduate level Google-Proof Q&A Benchmark, GPQA. And on this GPQA benchmark, it’s not as stark as on the math and programming benchmarks, but again, OpenAI o3 comes top with a score of almost 88, whereas the next best model is OpenAI o3-mini high at 80. So still, a big delta, especially as you get closer and closer to 100. So yeah, exciting things to come. Another week, another major breakthrough in AI capabilities. I hope your brain is tingling with ideas for how you can streamline your own activities as well as potentially build world-changing applications with increasingly powerful and exponentially less expensive AI models at your fingertips.

07:46
If not, try chatting with an LLM to get some ideas. All right, that’s it for today’s episode. If you enjoyed it or know someone who might, consider sharing this episode with them, leave a review of the show on your favorite podcasting platform. Tag me in a LinkedIn post with your thoughts, and if you aren’t already, be sure to subscribe to the show. But most importantly, I just hope you’ll keep on listening. Until next time, keep on rocking it out there, and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon. 

Show All

Share on

Related Podcasts