27 minutes
SDS 820: OpenAI’s o1 “Strawberry” Models
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
Jon Krohn takes OpenAI’s new models for a spin in this Five-Minute Friday, learning their key strengths and limitations, and how the o1 series may represent yet another landmark for generative AI.
Codenamed “strawberry”, OpenAI’s o1-preview and o1-mini show approximate the mode of thinking expounded on by the late Nobel laureate Daniel Kahneman in his book, Thinking, Fast and Slow. Through reinforcement learning, OpenAI’s o1 models are trained to ‘think’ and apply a ‘chain of thought’ to solving a given problem. This multi-step approach involves these models exploring, experimenting, reasoning and redrafting, all of which take time and therefore present a move away from the current open-access LLMs that often sacrifice accuracy for speed and intuitiveness.
Among others, what the o1 series presents to us is the conundrum of productivity. The OpenAI research team is looking into future improvements that think for much longer – hours, days, even weeks – before generating a response. In a commercially minded industry that aims for speed and accuracy to improve a bottom line, these models postulate a very different thesis: Whether human or machine, time, and lots of it, is still absolutely critical to obtaining the results we need.
Of course, the benefits of thinking slowly will be evident when resolving needs that have already demanded considerable human brain power; o1’s slow thinking approach would be better applied to finding a cure for cancer than writing that awkward email to your boss. For the latter, you still have LLMs like Claude, Gemini and ChatGPT to hold your hand. But the emergence of o1 will nevertheless come as good news to those of us who work to answer some of science’s toughest questions.
Listen to the mechanics and scaling capabilities behind OpenAI’s development of its o1 models, and Jon Krohn’s personal experience of the model in this Five-Minute Friday.
ITEMS MENTIONED IN THIS PODCAST:
DID YOU ENJOY THE PODCAST?
- What task would you be curious to see OpenAI’s o1 models complete?
- Download The Transcript
Podcast Transcript
(00:02):
This is Episode Number 820 on OpenAI's o1 model codenamed Strawberry.
(00:19):
Welcome back to the Super Data Science podcast. I'm your host, Jon Krohn. Let's start off with a couple of Apple Podcasts reviews. Our first one today is a 5-star review from Apple Podcasts and it's by Dr. Wade Ashby. He is the Department Chair of Computer Information Systems at Howard Payne University in Texas. Dr. Ashby says that this podcast has allowed him to keep current in many areas and also reminds him to provide the practical to his students. He had a lot of specific comments about Julia Silge's recent episode on the show. And he says, "Again, thank you for making the complex simple and I hope I can do the same for my students." Thank you, Dr. Ashby. And I hope so too. Glad to provide some inspiration.
(00:59):
Our second Apple Podcasts review today comes from someone with the username, D321P. They say that they are a loyal Last Week in AI listener and that they heard me as a guest on that show and figured they should check out some Super Data Science episodes and that they're glad they did. Because apparently, we have great content on the show and now form part of D321P's listening on a weekly basis. Cool. Glad you found us and I do love Jeremie and Andrey over at the Last Week in AI podcast. Hopefully, I'll be back on there again soon.
(01:35):
Thanks for all the recent ratings and feedback on Apple Podcasts, Spotify, and all the other podcasting platforms out there as well as for the likes and comments on our YouTube videos. Apple Podcasts reviews are especially helpful to us because they allow you to leave written feedback if you want to and I keep a close eye on those. So if you leave one, I'll be sure to read it on air like I did today.
(01:54):
All right. Onto the meat of today's episode which is a big one. I mean, given the gravity of the event, today's episode could be on nothing other than OpenAI's new o1 series of models which represent a tremendous leap forward in AI capabilities. So far, OpenAI have released o1-preview and o1-mini from a series of o1 models that they've been developing. Unless otherwise stated in this episode, I'm going to be talking about o1-preview which is now, in my view, unquestionably, the state of the art in terms of any publicly available AI model. o1-mini, on the other hand, is a smaller and, therefore, 80% cheaper to run model that was trained on the same o1 protocols. The same training methods.
(02:39):
In a nutshell, and as detailed last year in Episode Number 740 on OpenAI's Q* Project, which is later renamed Strawberry, the o1 large language model was trained with reinforcement learning to "think before responding". And it does this "thinking" using a private "chain of thought". So working through problems slowly and carefully like this is analogous to the slow System 2 thinking popularized by the Nobel Prize-winning economist Daniel Kahneman in his book, Thinking, Fast and Slow.
(03:13):
Previously, all of the top public LLMs executed solely in a mode more like human System 1 thinking which is faster, but more like intuition when you speak without careful consideration. Slow System 2 thinking, like when you work through a challenging math problem step by step with pencil and paper, allows OpenAI's new Open 1 Model to iteratively refine its outputs, try out different strategies, and even recognize and correct its own mistakes. So in a nutshell, again, all the previous LLMs that we had were like our intuitive, fast System 1 thinking where you just blurt some answers out. And now, with o1 for the first time, we have a big publicly available model that carefully works through problems. Iterates more like when you're thinking about a math problem.
(04:05):
What's crazy about this is that the longer that the o1 model "thinks"... I'll just call it thinking without the quote-unquote from now on. So the longer that the o1 model thinks, the better that it does on complex tasks. As pointed out by Dr. Noam Brown, one of OpenAI's o1 researchers and our guest in Episode Number 569 of this show, in an excellent tweet thread that I've got for you in the show notes, Dr. Noam Brown provides a whole new dimension for scaling AI models.
(04:35):
Previously, we could scale by increasing the amount of high quality training data, increasing the number of training parameters in the model, or increasing training time. Now, thinking time during inference can be scaled up too. So not just data, not just training parameters in your model, not just training time. But now, your inference time compute, how long you spend thinking about a problem, how long the algorithm thinks about the problem can be scaled up.
(05:02):
This means that while the o1 model currently available thinks only on the scale of seconds when it's preparing a response for you, the research team at OpenAI is aiming for future versions that think for hours, days, or even weeks before generating a response. This dimension of scaling up inference time will, of course, scale up the cost of inference. But for high impact outcomes, a new cancer drug, a breakthrough on nuclear fusion, mathematical proofs that humans haven't been able to crack, that higher inference cost would be well worth it.
(05:37):
With implications for the singularity, these longer thinking times will also surely lead to AI models like o1 contributing to the development of even better AI models. Creating a positive feedback loop that could accelerate shockingly, rapidly toward the singularity. All right. So that's thinking ahead a little bit. Maybe not too far into the future now. And so, that gives us a glimpse into the possibly not-so-distant future of the singularity.
(06:04):
But in terms of o1's capabilities this very day, it's critical to note that o1 doesn't always perform better than the other leading LLMs like OpenAI's GPT-4, Anthropic's Claude 3.5 Sonnet, or Google Gemini. This is because for tasks like chat conversations, composing an email, or editing a paper, those kinds of tasks don't require slow System 2 thinking. They can be done with fast, intuitive System 1 thinking.
(06:31):
Instead, where o1 excels is on the same kinds of tasks that you need to spend time deliberating on before blurting out a response. Such as when you're writing intricate computer code, performing data analysis, or solving math problems. So it's on these kinds of complex tasks where o1 is comparatively unreal, like way better than anything else out there. It's a really big deal.
(06:54):
On the usual benchmarks that we've seen in recent years, like MMLU and high school advanced placement exams, o1 offers improvements, for sure. But on MMLU categories like math and logic as well as on AP exam subjects like physics and calculus, the improvement of o1 over GPT-4o is huge. Perhaps, most mind-blowingly of all, according to OpenAI's own evaluations, o1 performs comparably to PhD level students on specific questions in physics, chemistry, and biology.
(07:24):
As a striking demonstration of what's to come soon through scaling further, OpenAI also teased us with preliminary results from an o1 model that is still in development. In competitive programming, this in development o1 model ranked in the 89th percentile on Codeforces questions. Codeforces is like a popular competitive programming platform. And so, the in development o1 model ranked in the 89th percentile there, on this competitive programming.
(07:54):
The o1-preview model, that you have access to today, got 62%. So 89th to 62nd, that's a pretty big deal. And both of those are way better than GPT-4o, which was the state of the art or close to the state of the art until this release. GPT-4o had just 11%. So 89th percentile for the in development o1 model, 62 for the preview model that we have today, and just 11% for GPT-4o.
(08:20):
Similarly, on a qualifying exam for the International Mathematics Olympiad, OpenAI reports that the forthcoming o1 model scored 83%. The publicly available o1-preview model scored 62. And the once mighty, now suddenly humble-looking GPT-4o scored only 13%. Now, as always, it is, of course, important to approach these claims with a healthy dose of skepticism. Because AI benchmarks, as I say on the show all the time, can be unreliable and can be pretty easy to game.
(08:52):
In this particular case, however, on many complex tasks that I tested personally, the delta between o1 and any other text generating model available today is so vast, that I am confident evangelizing to you that o1 is a serious game changer. The difference in capabilities is night and day, just like the jump from GPT-3.5 to GPT-4 was last year.
(09:18):
One interesting demonstration of o1's abilities, for example, is its capacity to count the number of R's in the word strawberry. Which I suspect is the namesake of this whole Strawberry o1 Project. So that task of counting the number of R's in the word strawberry, that might sound like a really easy task. But it actually stumped previous language models due to the way that we tokenize words into subwords when they go into a model. So that idea of character level information is typically not available to large language models.
(09:55):
And so, this might seem trivial, but it actually represents a significant advancement in the model's ability to process and understand language. And if you are watching the YouTube version now of this podcast episode, then you can actually now see my screen. I'm going to have a few examples of different kinds of tasks that I tested with the ChatGPT o1-preview model.
(10:24):
So the very first one is the one that I just mentioned. So I did this test of asking how many letter R's there are in the word strawberry. The algorithm thought for six seconds and it had a couple of intermediate steps. So that's a cool thing about o1-preview is that it has a model that generates summaries of what the algorithm is doing while it's "thinking".
(10:52):
So for this one, for how many of the letter R in the word strawberry query that I asked, there were only two steps that got summarized. And in total, the whole process took six seconds. And in the end, it correctly outputted there are three letter R's in the word strawberry. As a second example, one where I didn't expect the model to necessarily perform so much better than GPT-4o or Claude 3.5 Sonnet from Anthropic is on a text question on just a reasoning question of, "Is a hot dog a sandwich?" And so, there's lots of information on the web about whether a hot dog is a sandwich or not. And all of that can be stored in the weights of a more traditional LLM that doesn't have the reinforcement learning training that o1 has and the slow Type 2 thinking that o1 has.
(11:56):
But nevertheless, it did an interesting thing where it thought for seven seconds and it broke down its consideration of the debate into a number of steps, into five steps, over those seven seconds. So it analyzed the debate. It examined the matter in detail. It assessed the definitions. It navigated tax regulation issues. And then, it looked at cultural distinctions. Then, ultimately, when it did generate an output, it provided a lot of information. Taking into account all of this analysis in a bunch of different respects. Then in total, took it seven seconds. So I thought that was interesting, even if that kind of output, asking is a hot dog a sandwich, isn't going to probably be that mind-blowing of a result relative to, say, GPT-4o.
(12:46):
Now, where things did start to get interesting, and my mind was seriously blown, was when I started copying and pasting questions from my teaching materials. Specifically, I chose questions exercises from my Calculus for Machine Learning curriculum. That I have online and people can check out if they want to. But the point is that there's a lot of exercises in there and I wanted to give it some relatively challenging exercises.
(13:17):
So in my section on higher order partial derivatives, I grabbed, I copied and pasted a specific exercise from that into the ChatGPT interface in the o1-preview model. And so, the specific question was find all the second order partial derivatives of... Not necessarily the most sophisticated partial derivative question, but nevertheless a non-trivial one. And so, it's Z equals X cubed plus 2XY. So again, not super, super long. I did convert the mathematical equation into LaTeX to guarantee that o1-preview would be able to interpret that effectively as opposed to having a guess. So when I do it in LaTeX format like that, it makes it easy for me to describe the cube, for example, in a way that I know the algorithm will understand.
(14:14):
And in this case, o1-preview thought for 12 seconds. And yeah, again, it broke down that process into four steps for me. And within each of the four steps, it even provided an explanation with mathematical notation. And it's beautiful, the way that it printed it all out. And then, it gave me exactly the correct answers in exactly the same way that I do it in my solutions to the exercises in the materials. It even went on to explain some related theorems in its explanation and why some of the results for some of the partial derivatives are the same as others. It was really well-presented. Really enjoyed it. It was outstanding. A+ work. If a student had done this, I would've been absolutely blown away.
(15:05):
I grabbed another question from my materials on the definite integrals section of my Calculus for Machine Learning curriculum. And so, in this case, there was an exercise. Again, not super complicated, but calculating a definite integral over the range of three to four. Very simple equation that we're integrating here. Just 2X is the equation. But nevertheless, when I asked o1 to provide a response, it did an excellent job of breaking down the steps. And it absolutely got the correct answer after it had done that "thinking" to work through the steps of this problem. I was super impressed. And again, the answer, if a student had provided this to me, I would've been absolutely stunned. It would've been an A+ answer for sure.
(15:55):
Now, in contrast to those relatively simple, introductory calculus questions for a first year or second year calculus course in university. In contrast, I also asked o1-preview a very complex question with respect to calculating the minimum sphere radius. I'm not even going to read the whole question, but it's related to, because there's a lot of mathematical notation, but it's related to this set of rectangular boxes with a certain surface area and volume. And then, we let R be the radius of the smallest sphere that can contain each of the rectangular boxes.
(16:34):
And then, I ask the algorithm to find these specific variables P and Q which are prime and positive integers. And so, there's a lot of constraints in this problem. It's very complex. It, frankly, is not a mathematical problem that I would have any clue how to solve. I have no idea. For this one, o1-preview took 95 seconds to respond. The longest of any question that I asked it. And it broke down the problem into lots of different steps. It tried exploring different avenues to solving the problem. And it just kept going back and trying other approaches. Until eventually, after many steps, several dozen separate steps that it broke down for me over those 95 seconds, it then generated an output. It was very easy to read and quite compelling. And as far as I can tell, it is the correct answer. I'm not 100% sure, but it does look correct. The math seems to work out.
(17:39):
And actually, by showing me the solution, it helped me to understand what the question was even asking. So just... I mean, that was stunning, really mind-blowing for me. And then, I also wanted to stump it with a challenging programming question because there's a lot of... With other large language models like GPT-4o, I might say something like, "Create a template for a convolutional neural network," or, "a transformer architecture in PyTorch." And I know that even for GPT-4, that's going to be a really easy question. It saves me a lot of times as a human, but isn't a very complex question to ask one of these algorithms.
(18:14):
So what I did was, in this case, I asked the algorithm to write HTML that provides an interactive visualization. And I made this pretty complicated. So it's a graph representation of a dense neural network with four layers of neurons. And then, I specify how many neurons should be in each layer. I indicate that arrows should show the flow of information from input layer to output layer. And that it should generate random near zero values with precision of just a couple of values after the decimal point place and I want those near zero values to show up as the network's weights and biases.
(18:52):
This HTML code is interactive. So when I hover over a node in the graph, it should reveal the bias value that was randomly generated for that neuron. When I hover over an edge, it should reveal the weight for that connection. So pretty complicated. The algorithm thought for 39 seconds. It broke it down again into a couple dozen steps over those 39 seconds. And then, ultimately, output some HTML code that I copied and pasted into my browser and it absolutely worked.
(19:21):
So when I hover over weights, I see weights. When I hover over neurons over the nodes, I see biases. And the network had exactly the structure that I asked for, the dense neural network connections, the exact number of neurons per layer that I specified. The only issue was that it didn't have actual arrows, so just the edges weren't arrows flowing from left to right. So I went back into that same chat session and I said, "This is great. However, the edges aren't directed arrows like I requested." The algorithm thought for 12 seconds, o1 thought for 12 seconds. Again, a number of steps, trying to figure out what was wrong. And so, there was some kind of issue related to elements being generated through CSS as opposed to HTML. It tried to correct it. It explained why it should now work.
(20:14):
I copied and pasted the corrected code. I don't understand HTML or CSS very well at all, but it didn't seem to actually fix it. Even after that, there still weren't arrows. Even though it thought that there should be. I probably could have continued to press, but it was a satisfying enough exercise for me. So yeah. So, of course, this model isn't perfect, this o1 model, but it stunned me. It really impressed me in a lot of ways. And yeah, as I said earlier, I really do believe that this is the state of the art now available today.
(20:50):
Given these terrific capabilities, there's also the risk of o1 being terrifying if it were in the wrong hands. So on the safety front, OpenAI claims to have developed a new training approach that leverages the model's reasoning capabilities to better adhere to safety and alignment guidelines. Actually, that makes a lot of sense to me. Because if you think about how well it's able to review its own intermediate steps for errors, you could also leverage that sensibly, it seems to me, for safety and alignment.
(21:22):
So OpenAI reports significant improvements as a result in the model's ability to resist, say, jailbreaking attempts. So for example, while GPT-4o scored 22 out of 100 on a particularly stringent jailbreaking test that OpenAI says they have. o1, in contrast, scored 84 out of 100. So that's a big difference that ostensibly flips the probability of these tricky jailbreaks happening, from being usually possible to being relatively rare.
(21:51):
So yeah, I think it's great that labs like this are at least taking some considerations. I know that there's a lot of controversy around OpenAI, potentially, being able to be more careful. And so, groups... People have left to form Anthropic. More recently, Ilya Sutskever left to create Safe Superintelligence because they think that OpenAI isn't taking enough safeguards into consideration. Anyway, I won't wade into that too deeply. This is more about a capabilities episode here today.
(22:23):
But one final note, one caveat here is that on the note of machine consciousness... So it always seems to be a point of discussion, consciousness, whenever a markedly more capable AI model is released, like this o1 release. And I think it's crucial to maintain some perspective here. While I used anthropomorphizing language in this episode, and lots of people in the industry do, don't forget that these AI systems don't actually think or reason.
(22:53):
The underlying computational mechanisms are the same as your calculator or a spell checker on your computer. We're simply figuring out how to leverage these non-conscious computational processes in increasingly nuanced and powerful ways. All right? So if you want to access o1-preview today, you can do that. So you can go into ChatGPT+ like I did for preparing for this episode. And if you're watching the video version, you actually even saw my screen share.
(23:26):
And so, I was just in ChatGPT conversing with it to create the kinds of examples that I had today. So if you use that ChatGPT+ or ChatGPT Team, which are the click and point user interfaces that OpenAI provides, then... It was relatively limited. But just a couple of days before this episode was published, they expanded access to o1-preview to 50 messages per week and o1-mini to 50 messages per day.
(23:56):
50 messages per day, that's quite a few. 50 messages per week, you could burn through those pretty quick with o1-preview. But that's a big increase from just a few days earlier. So you can expect that capacity will continue to become more and more available to this model. And in addition to being able to access it through the user interface, if you're a Tier 5 developer or trusted developer with OpenAI, then you can use the OpenAI API much more frequently for developing applications that would leverage this technology.
(24:29):
So in that case, whereas with the user interface, you only got 50 messages per week with o1-preview, with the OpenAI API, you can call the o1-preview model 100 requests per minute. So many orders of magnitude, more there. And o1-mini, you can call even more frequently. You can do 250 requests per minute to that.
(24:54):
All right. So how does this shift my own use of text generating AI models? Well, so now, I'm probably still going to end up using Claude 3.5 Sonnet for most everyday or creative tasks. But I will now be using o1 for any tasks that require more slow, detailed, thoughtful System 2 style thinking like math, and complex programming questions. And then, I still will also be using Google Gemini for anything that requires a huge context window like passing in large audio files or large video files. So there you have it. Those are my Top 3 text outputting AI models: Claude 3.5 Sonnet, o1 from OpenAI, and Google Gemini.
(25:36):
So in conclusion, OpenAI's o1 model, that's a really big deal. This is, in my view. Unquestionably the state of the art in AI capabilities today. And thanks to the potential of scaling "thinking time" and inference, we are not far off from even more staggering and world-changing AI models. Today, o1 demonstrates that we could have PhD level thinkers in specific fields thinking 24/7 about tricky problems. And while they're relatively expensive today, these costs will converge towards very cheap, very fast.
(26:08):
Scaling to longer thinking, these very cheap abundant sources of intelligence may soon have reasoning capabilities far beyond PhD students. Humming away 24/7 all around the world, working on our most pressing problems, these very intelligent, relatively inexpensive technological processes may be accelerating us towards artificial general intelligence, AGI, and the singularity surprisingly soon. Especially because of one of the pressing problems that these could be applied to, as I mentioned earlier in the episode, is building better AI systems. So that could be something, a positive feedback loop that feeds on itself very, very rapidly indeed.
(26:51):
All right. That's it for today's episode. If you enjoyed today's episode or know someone who might, consider sharing this episode with them. Leave a review of the show on your favorite podcasting platform. Tag me in a LinkedIn or Twitter post with your thoughts. I'll respond to those. And if you aren't already, make sure to subscribe to the show. Most importantly, however, we hope you'll just keep on listening. Until next time, keep on rocking it out there and I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.
Show all
arrow_downward