(00:02)
This is episode number 666, releasing the demon… I mean on the release of GPT-4.
(00:19)
So episode number 666 is here. The number of the beast. Are we unleashing something demonic? Well to some, GPT-4 does portends something sinister. OpenAI, however, spent six months trying to make the algorithm as safe as possible, so it should do much more good than harm. Indeed, their investigations suggest that GPT-4 is 82% less likely to provide responses on disallowed content than its predecessor GPT-3.5. One of the big issues with these models has been hallucinations confidently stating made-up content as facts. And this issue has also been attenuated with GPT-4. So GPT-4 is 40% more likely to have factually correct responses relative to GPT-3.5. Regardless of these ongoing, albeit now considerably diminished issues associated with unwanted sociodemographic biases, harmful content and hallucinations, the GPT-4 model is undeniably incredible.
(01:26)
I reran my queries that I was impressed with GPT-3.5 and the GPT-4 responses absolutely blew them out of the water. With its improved reasoning and consistency over long stretches. For example, GPT-4 summarized in simple language the abstract of an academic article I wrote on genetics years ago. It did it perfectly and then it seemed so human-like, so accurate that I copied and pasted the text into the GPTZero AI detection tools that are supposed to be able to identify when you’ve used a GPT model instead of your own creativity to write some text. And this GPTZero model predicted that 0% of that GPT-4 content was written by AI. It said it was a hundred percent human, but you don’t have to take my qualitative experience as gospel.
(02:20)
OpenAI compared performance on a couple of dozen academic and professional exams and GPT-4’s ability to retain context over long stretches of natural language and its additional nuance allowed it to drastically outperform GPT-3.5 on some of these exams. For example, on the uniform bar exam, a legal exam in the US, GPT-3.5 scored at the 10th percentile, whereas GPT-4 scored at the 90th percentile. That’s crazy. It means that nine in 10 examinees used to be above the GPT algorithm GPT-3.5 and with GPT-4, only one in 10 examinees is above it. In addition to being just way better at generating natural language, GPT-4 is also multimodal. So this is something new for these OpenAI models for these OpenAI GPT models, anyway. And this multimodality means that it can handle more than just language.
(03:16)
So it actually also takes visual inputs. So you can take a picture of your fridge, of the contents of your fridge and you can give GPT-4 the prompt “What can I make with this?” And GPT-4 will suggest lots of different recipes that you could make lots of different prepared foods that you could make with whatever you have in your fridge. There’s also a video that we’ll provide you a link to in the show notes showing how drawings can be turned into a functional website in minutes with GPT-4. That’s crazy. And this visual capability also allowed it to perform much better on some of the standardized tests that it undertook. So I already mentioned the uniform bar example that was able to be done with the natural language model alone, but for the Biology Olympiad its visual capabilities are what allowed it to really excel.
(04:07)
So GPT-3.5 scored in the 31st percentile on the Biology Olympiad, but GPT-4 with its visual capabilities scored in the 99th percentile. Couldn’t be topped. On top of all that, the general much improved natural language performance and the multimodality GPT-4 is also capable of performing in many languages, including the ability to translate between those languages. More specifically GPT-4b, GPT-3.5 and other large language models such as DeepMind’s Chinchilla and Google’s PaLM in 24 out of 26 languages studied, and that includes low resource languages like Latvian, Welsh and Swahili. So how did OpenAI do it? Well, model parameter increases were likely a big factor. So OpenAI did not release official numbers, or at least they haven’t yet, but it’s very likely that model parameters were scaled up orders of magnitude beyond the 175 billion parameters of GPT-3.
(05:12)
In addition to that, more data were fed into GPT-4 to train it, and those data were better curated than ever before. And critically GPT-4 handles much more context. So there’s an 8k version that handles 8,000 tokens. If you wanna understand more about tokens and how those are related to words, essentially you can have a few sub-word tokens make up a single word and you can hear more about that in episode 626 of this podcast. But yeah, so there’s an 8,000 token version of GPT-4. There’s also a 32,000 token version of GPT-4. And so that ladder one 32,000 tokens that corresponds to about 25,000 words, which is about 50 pages. And previously the GPT-3 algorithm could only consider about 4,000 words. So we’re talking about an order of magnitude more context that the 32k version of GPT-4 can handle.
(06:16)
On top of all that, so on top of the much-increased model parameters, much more better-curated data and being able to handle much more context, there’s also this really nifty trick that’s been applied with the GPT-4 called Reinforcement Learning from Human Feedback. So this can be abbreviated to RLHF and it refers to the fine-tuning of a model with something called Reinforcement Learning. You can check out my YouTube channel, I’ve got an introduction to reinforcement learning there. But reinforcement learning is a kind of machine learning approach that allows models to adapt to a changing environment. And specifically in this case, if you’ve ever used ChatGPT, it gives you the ability to give a thumbs up or thumbs down on any outputs that ChatGPT gives you. And those thumbs up, thumbs down provide data that refines the ability of the GPT-4 model using this reinforcement learning procedure to output results that are aligned more with the kinds of results that you were hoping the model would output.
(07:22)
And so this was initially used to great positive effect in a model from OpenAI called InstructGPT that was better able to follow instructions better. And yeah, so now we see that same reinforcement learning from human feedback RLHF making a big difference in this GPT-4 models alignment with the kinds of responses that we’d like to have as well. So how do you access GPT-4? Well, you can get it with the ChatGPT Plus user interface. So this has a monthly charge, in the US it’s about $20 a month, and so you can access GPT-4 right there in a click-and-point user interface. I’ve had a ton of fun playing with it in there. As I already mentioned earlier in this episode, it blew out of the water things that I was doing in GPT-3.5 and I was impressed with what I was doing with GPT-3.5.
(08:09)
So this GPT-4 really is incredible. If you haven’t used it, I highly encourage you to. And if you want to use it beyond a user interface, so if you want use it for serious software development or serious data science, you can apply for API access with a simple form. We’ve got a link to that form in the show notes. Several organizations with early access already have innovative new products that take advantage of GPT-4’s unparalleled capabilities. So for example, the foreign language training app Duolingo has used GPT-4 to make the learning of languages more interactive and fun. There’s an accessibility app for the visually impaired called Be My Eyes, that’s taken advantage of GPT-4 and the government of Iceland has used GPT-4 for an Icelandic language preservation program.
(08:56)
So if you wanna learn more about ways that you can be using GPT-4 effectively within, say the ChatGPT interface, you can refer back to episode number 660 for five data scientists specific tips and episode number 646 for more general tips that anyone can make use of. Now with support for more programming languages than ever with GPT-4 and with of course, better nuance and accuracy, coding is easier than ever so people without programming experience are creating in minutes simple games like Pong and simple apps like a drawing app or film recommender. With all of these enhanced capabilities with GPT-4, those kinds of ChatGPT tips that I gave in those preceding episodes are more powerful than ever. So GPT-4 has already enabled enormous leaps in product functionality at my machine learning company Nebula, leaps that I wouldn’t have imagined possible at all before GPT-4 was released.
(09:57)
So it’s been, oh man, recent weeks have been amazing for my ability to feel creative and empowered with what I can be doing with data science and bringing it to life to users within my machine learning company. So coming up on Tuesday’s, episode number 667, the very next episode of this show, we’ll have Vin Vashishta back on the show, he was on a couple of years ago. This time he’ll be coming on specifically to cover how you can be using GPT-4 to build better products, to create commercial opportunities for yourself. And then in the episode after that, in number 668, we’ll have Jeremie Harris on the show and he’ll be talking about how GPT-4 presents unique AI policy risks. And so Jeremie may have more info on why if GPT-4 itself isn’t a danger, why similarly powerful models that didn’t go through OpenAI’s meticulous AI safety research, so for example, models that are like GPT-4 might be available open source in months, and if they haven’t gone through the same kind of AI alignment AI safety process, these six months that OpenAI put into trying to make GPT-4 safe, well those open source versions could be used for gravely malicious means on a previously unimaginable scale.
(11:22)
So perhaps episode number 666 was fitting for this episode after all, we’ll have to wait and see. All right, I hope you enjoyed this Five-Minute Friday on GPT-4, whether you’re scared or excited or both. And until next time, keep on rocking it out there folks. And I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.