SDS 676: The Chinchilla Scaling Laws

Podcast Guest: Jon Krohn

May 5, 2023

DeepMind, Chinchilla AI, and fine-tuning proprietary tasks with large language models: On this week’s Five-Minute Friday, host Jon Krohn explores the incredible power of this large language model family that seemingly runs on just a quarter of Gopher’s power. 

 
One of the drawbacks to using large language models is their sheer size – and therefore cost. A model like Gopher, a DeepMind creation, ran on 280 billion model parameters. So, how can users run such models on a budget more modest than those deployed in Silicon Valley? The large language model family Chinchilla, another DeepMind creation, managed to outperform Gopher in a series of natural language tasks, all while running on just a quarter of the model parameters. 
Its success can be explained by the Chinchilla Scaling Laws, the principles of which host Jon Krohn details in this episode, and which were used by Cerebras in their newly released family of models called Cerebras-GPT. The key with Cerebras-GPT is that it is fully open source, and it has a commercial use-friendly license, meaning you can now take those model architectures and fine-tune them to your own proprietary data, if you want to be getting near ChatGPT quality performance.
Listen in to hear more about how these Chinchilla Scaling Laws enable incredible performance, what tokenization has to do with improving natural language model performance, and how Chinchilla might circumvent potential limitations. 

ITEMS MENTIONED IN THIS PODCAST:
DID YOU ENJOY THE PODCAST?

Podcast Transcript

(00:06):
This is Five-Minute Friday on the Chinchilla Scaling Laws. 

(00:19):
Back in episode number 670 on a model architecture called LLaMA, I talked about the Chinchilla Scaling Laws for the first time. And so to give you some more context on this really important concept, the Chinchilla Scaling Laws come from research by Google DeepMind that was published in March, 2022. So this research is about a year old, but you’ll see by the end of the episode why it’s super relevant to everything in machine learning today. Particularly with respect to large language models like the kinds of models that we’re seeing behind ChatGPT, GPT-4, and so on.
(01:00):
So the idea with this Chinchilla paper was that these DeepMind researchers wanted to find how they could optimally pre-train a large language model for a broad range of natural language generation tasks given a fixed compute budget. So in order to do this experiment, it’s a massive experiment, they trained 400 different transformer architectures. So transformers are the deep learning model structure that we build up into a large language model. And so they trained 400 of these transformer architectures, so effectively 400 of these large language models. And they varied the size of these models quite a bit. So they had models as small as 70 million parameters, which is still pretty large. But they ranged up to 16 billion parameters. And then in addition to ranging the parameter size from 70 million up to 16 billion parameters, they also ranged the token size. So the number of parts of words, you can hear all about this concept of tokenization from words into sub words in episode number 626, but you can just think of it as kind of like a word for our purposes today.
(02:25):
And so these DeepMind authors, they worked with dataset sizes ranging from 5 billion tokens up to 500 billion tokens. So overall, these 400 different training situations, and in so doing, they determined that the compute optimal ratio of model size to number of tokens, natural language tokens that you train on is 20:1. So 20 tokens for every parameter that you have in your model. So this means that if you want to train a large language model with a billion model parameters, then you’re going to want to have about 20 billion tokens in your natural language training data set to train that. So this means, given that fixed 20:1 ratio at any size, any model size, this means that if we double our model size, then we need to double our dataset size. So if we double our model size from 1 billion parameters to 2 billion parameters, then we’re going to need to double our ideal size of our trading data set from 20 billion tokens to 40 billion tokens. So hopefully that is crystal clear. 
(03:45):
So the reason why these are called the Chinchilla Scaling Laws is because based on the laws that I just expounded for you, the DeepMind researchers in the same paper, which of course I’ve linked to in the show notes, they created a model that they called Chinchilla. And so it had 70 billion model parameters. And the reason why it was called Chinchilla is because they were using it to compare against a model called Gopher that already existed that had 280 billion model parameters. So at 280 billion model parameters, Gopher was four times the model size, but it was only trained on a quarter of the data that the Chinchilla authors did. So the Chinchilla authors used what they learned from their enormous experiment over these 400 different training situations, and then they scaled up the scaling law that they determined to a really big, large language model, 70 billion parameter model. And when they trained it according to their scaling laws, it had four times the data of Gopher, but it was only a quarter of the model size.
(04:52):
So this was compute optimal training. And the important thing about this compute optimal training is that they were able to, for the same price as was as the cost of training Gopher, they were able to uniformly defeat it across every performance task that they trained it. So there’s, there’s these benchmark natural language tasks, which you can read about more on the paper. But across all of these tasks, Chinchilla crushed Gopher despite being a quarter of the size and costing the same amount of train. And remember that this paper is just a little over a year old. And at the time, they also compared Chinchilla to the other state-of-the-art models like GPT-3 and Megatron, and these models are also many times larger than Chinchilla, but Chinchilla still performed better because of this key thing about having way more training data relative to the model size. So again, that 20:1 token to parameter ratio, that defines the Chinchilla Scaling Law in a nutshell.
(05:58):
So this has important real-world implications as well, because not only does this mean that you have this rule of thumb for how much training data you’re going to want to optimally train a model of a given size. It also means that actually models don’t need to be as big as we might have thought. So this means that a model like Chinchilla, because it has only 70 billion parameters relative to Gopher’s 280 billion or GPT-3’s 180 billion, Chinchilla is cheaper to fine-tune on your own proprietary task, and it’s also cheaper to use at inference time. So this broadens the range of viable applications and commercial applications that you can use models for.
(06:49):
Now, getting to the new news, because everything that I’ve just told you is your old news, but it’s really important today because on March 28th, a company called Cerebras released a family of models called Cerebras-GPT, like Dolly 2.0, and GPT4All-J, which I covered in episode number 672 a fortnight ago. In that episode, I detailed these other open source architectures that, so Cerebras-GPT as well as Dolly 2.0, GPT4All-J, and no doubt, many more model architectures, open source model architectures that are coming out right now for doing natural language generation tasks, like those, Cerebras-GPT has an open source model architecture, it has open source training data, it has open source model weights, and it has a commercial use-friendly Apache 2.0 license.
(07:43):
So that’s really key, especially when you compare with other well-known relatively small LLMs today, single GPU LLMs like LLaMA and Alpaca, which I talk about also in that episode two weeks ago, number 672, and of course the LLaMA episode, episode number 670. So the key thing here is that with these newly released fully open source models with their permissive commercial use-friendly licenses, you can now take those model architectures and fine-tune them to your own particular proprietary data, which could be specific to some particular proprietary natural language generation task that you would like to be able to perform, perhaps for yourself or for users of a platform that you’ve developed, a software platform.
(08:33):
The key thing here that’s new about the Cerebras-GPT release is that the models that they released follow these Chinchilla Scaling Laws. So this means that they varied the dataset side. So Cerebras released seven models. All of these are available in Hugging Face and I provided a link to that. So you can easily import these into your Python code and just a line of code. And so these seven models are in Hugging Face and they range from 111 million parameters up to 13 billion parameters. So that 13 billion size, that’s starting to get to be about as big of a model as you can fit on a single large GPU, and that’s comparable to the size of LLaMA and Alpaca that I’ve already talked about. And so that kind of size around 13 billion is going to be your best shot for fine-tuning a model to have the breadth of capabilities of something like ChatGPT. And if you’re looking for a relatively inexpensive way to do that kind of training efficiently, you can check out my episode last week on parameter-efficient fine-tuning. So that’s episode number 674. 
(09:41):
So if you want to be getting near ChatGPT quality performance, you’re going to want use one of these big new Cerebras-GPT models. If you want to experiment with smaller, large language models which you could very easily fit onto a single GPU for training and then you would be able to deploy very efficiently into production, maybe even deploy for use on edge devices like phones, there could, because these models go down to as small as just 111 million parameters. So with these smaller models from Cerebras-GPT, you can experiment with having these models perform on a narrower range of domain specific natural language generation tasks that might be important to you. So you have this trade off that you can now play with, with these seven models from Cerebras-GPT, varying in sizes from 111 million model parameters up to 13 billion model parameters. And you can have the peace of mind that because they train them following these scaling laws, you’re getting a compute optimal model to fine-tune.
(10:47):
So yeah, really clever new thing from Cerebras-GPT varying the dataset size at that 20:1 ratio that we learned from the Chinchilla training laws for their seven different models. And it sounds like they might potentially in the future also release models that are much bigger than the 13 billion, those could be in the works. So this has an interesting implication, and I read about this in a blog post which I’ll also provide to you in the show notes called AI Brick Wall. And so the idea of this is that if we follow these Chinchilla Scaling Laws as our, as our current dense transformer-based large language model approaches, as we keep trying to scale them up, there’s this practical limit because of cost implications. Because if we wanted to train a 10 trillion parameter model, according to the Chinchilla Scaling Laws, there would be a $30 billion cost associated with training that which is wild. And so there’s probably this prohibitive cost associated with training a model that large. And in addition to that, because we want to have this 20:1 training data to model size ratio, it probably means that it’s completely impractical for there to be that much training data out there. I think it’s far beyond the amount of training data, non-synthetic training data that would be available for training such an enormous model compute efficiently according to these Chinchilla Scaling Laws. 
(12:25):
So there you go. I hope you enjoyed another Five-Minute Friday digging into large language models and how you can be using small open source versions. Now we’ve talked about even smaller ones than ever before in today’s episode. They’re compute efficient as of today’s episode, and I provided you last week with approaches for parameter-efficient fine-tuning, for getting these relatively small LLMs tuned to a task that is of most interest to you. So hopefully super valuable for you. Get your brain cells flowing on how you can be applying these kinds of new techniques, these amazing new open source models for a huge range of natural language generation task, generative AI tasks for you and for your users. All right, that’s it for today. Until next time, keep on rocking it out there, folks. And I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon. 
Show All

Share on

Related Podcasts