SDS 678: StableLM: Open-source “ChatGPT”-like LLMs you can fit on one GPU

Podcast Guest: Jon Krohn

May 12, 2023

Another week, another LLM! This episode, Jon explores StableLM, the new family of open-source language models from the creators of Stable Diffusion. Small, powerful, and trained on an unprecedented amount of data for single GPU LLMs, tune in to learn more about the mechanics behind this groundbreaking family of LLMs. 

 
The team behind the AI image generator Stable Diffusion recently released their first family of open-sourced language models, StableLM, and as Jon explains, these models are small but mighty! So far, Stability AI has released two model sizes, a 7 billion parameter model and a 3 billion parameter model that have both been trained on an unprecedented amount of data for single GPU LLMs, making a big difference to their quality. You will find them comparable to the famous Dolly 2.0 model, but with the added benefit of being small and easy to train.

This single GPU model can be fine-tuned to your own proprietary data, making it perfect for your specific problem domain or customer use-case, and delivering the capabilities of GPT-4 on your infrastructure! If you’re as excited as we are about these groundbreaking language models, you won’t want to miss diving deeper into the mechanics of the model. Join us for an exciting deep dive into StableLM with Jon!

Podcast Transcript

(00:02):
This is Five-Minute Friday on StableLM.

(00:19):
Stability AI, the company that’s best known for the open source and super popular text-to-image generators, Stable Diffusion, they just made headlines by releasing the first models from their open source suite of StableLM language models. So this stands for well stable, their company name Stability, AI, and then the LM is language model. So StableLM – stable language model. In previous episodes, I talked about single GPU large language models and these new StableLM language models also fit in that umbrella. So these are small enough to fit on one large GPU for training and you could potentially even do quantizing to get these at inference time small enough to fit on a CPU. So these StableLM models are super, super efficient just like LLMs covered in previous episodes. For example, in episode number 672, I talked about the GPT4All-J and Dolly 2.0 LLMs, which are similar in size, these new Stability AI models and these new StableLM models are also similar to GPT4All-J and Dolly 2.0 in that all three of these model families are acceptable for commercial use.
(01:42):
So this means that using the parameter-efficient fine-tuning that I talked about in episode number 674, you can quickly and easily fine-tune GPT4All-J, Dolly 2.0 or this new family, StableLM family, you can fine-tune them to your own proprietary data, to your own problem domain or to your own customer use case. So super useful. You can have GPT-4-like capabilities for your own particular use case running on your own infrastructure. So why am I back again today talking about another new model? Why is this StableLM newsworthy beyond the GPT-4All-J and Dolly 2.0 models that I had talked about previously?
(02:30):
Well, so like some of those models that I talked about previously these new StableLM models are very small. So they are much easier to work with on a single GPU, faster to train, you’re going to have faster inference times. So the two model sizes with these new that have been released so far by StableLM are a 7 billion parameter model and critically a 3 billion parameter model. And so that is like the Dolly 2.0 family that I talked about previously. The big difference is that in addition to being small, portable, easy to train, these models, these new StableLM models are trained on an unprecedented amount of training data for these kinds of single GPU LLMs. So we know from previous episodes like the Chinchilla Scaling Laws episode that I had, episode number 676, as well as the LLaMA episode that I had, episode number 670, that training for much longer, so having a training dataset size that’s at least 20 times bigger than the number of model parameters that you have, this can make a big difference to the quality of the model. 
(03:49):
And this with the StableLM family Stable Diffusion went even further. I mean, for the first step of training they have a crazy amount of data. So remember that for any of these GPT-4-like models, there are two training steps. So the first step is pre-training and the second step is fine-tuning. So in that pre-training step, we are giving the model a broad range of natural language to work with for it to begin to understand the kinds of things that humans would like to see. But it’s really in this second step that we fine-tune specifically on this approach called reinforcement learning from human feedback where we have examples of exactly the kind of output that we would ideally like to see the model output in a given circumstance. And so it’s that fine-tuning that gets us from something more like GPT-3 caliber outputs to GPT-4 caliber outputs. 
(04:53):
So let’s talk about the first step first, pre-training. So architectures like Dolly used an open source, a really big open source dataset called The Pile. But the folks at Stable Diffusion with this new StableLM architecture, they created an even bigger data set. It’s three times larger, so it has 1.5 trillion tokens in it. So tokens, you can hear more about these in episode number 626, when I talk about Subword Tokenization with Byte-Pair Encoding. For the purpose of this episode, you can simply think of a token as a word, but in practice it’s really subwords. So components of words. So this new data set that Stable Diffusion came up with has 1.5 trillion of these tokens. And this makes it three times bigger than the pile that was previously used to do the pre-training on most of these single GPU LLMs. 
(05:48):
And so for example, for the 3 billion parameter StableLM model, they used 800 billion tokens from their big new 1.5 trillion token data set. And so that’s 267 times more tokens than there are parameters in the model. You might remember from the Chinchilla Scaling Laws episode number 676, that that research showed that having 20 times more tokens than parameters gave great results. And that training beyond that, like they did with the LLaMA language models and episode number 670, showed that you can get even better results by training even longer. And so with StableLM Stable Diffusion have gone even further with this, as far as I know, unprecedented, for a model of this size, having for this 3 billion parameters StableLM model, having 267 times as many tokens for training, this should lead to an amazing model. And that’s just for the pre-training step. 
(06:56):
So the pre-training step will land us with a model that’s comparable to the performance of GPT-3 and people were very impressed with the performance of GPT-3 when that came out a couple of years ago. But it’s the second training step, the fine-tuning that, as I mentioned earlier in the episode, is the key to getting these really intuitive ChatGPT-like or GPT-4-like results. And so in order to do that fine-tuning effectively the authors, the Stable Diffusion folks, they use the Alpaca procedure. So I talked about the Stanford Alpaca model and data set back in episode number 672. And so they follow that procedure, but again, just like in their pre-training step for this fine-tuning step, the Stable Diffusion folks are again using way more data than any comparably sized LLMs previously. 
(07:54):
So for this new StableLM family, they’re using the Alpaca data, plus they’re using the 800,000 instruction response pairs from GPT-4All, they’re using the 52,000 exemplary instruction response pairs from the ShareGPT tool, they’re using the 15,000 that were used in Dolly and they’re using the HH, which is the helpful and harmless data that’s provided by Anthropic, Anthropic being a startup that is particularly focused on creating big large language models that are ethical. So that ethical data set is in there too. In total, there’s about a million instruction response pairs for the StableLM models to fine-tune on, which is on the order of a hundred times more instruction response pairs than Dolly 2.0, which is kind of the benchmark that I’ve been weighing StableLM against throughout this episode. 
(08:52):
So you can anticipate that this new StableLM model family is super powerful. It’s open source, it’s available for you today. So I provided a link to the Hugging Face repository to go grab these models straight away and start playing with them yourself. You can even go right now to chat interactively with the fine-tuned 7 billion parameters StableLM model. You can do that right away. So you don’t even need to download anything, you can just use it right in your browser. I’ve got a link for you to do that in your show notes. So that’s something exciting that you can do today. Play with it. I have a feeling that you’re going to be really impressed with the quality of the StableLM model and then maybe consider fine-tuning it for your own proprietary purposes. 
(09:41):
In addition, I’ve got good news for you, they’re coming down the pike Stable Diffusion has additional models coming. So they’ve already released the 3 billion and 7 billion parameter architectures. In addition, they have 15 billion, 30 billion, and 65 billion parameter models in progress. On top of that, they’re planning on doing a 175 billion parameter model, which would be on par with the original GPT-3 architecture size. Now sizes aren’t everything, as we know, training data set size, training time make a big difference as well, as well as these fine-tuning steps. So I don’t think we need to have a model that’s 175 billion parameters to meet or exceed the performance of GPT-3. Indeed, it wouldn’t be surprising to me if this 7 billion parameter, maybe even the 3 billion parameter StableLM model could outperform GPT-3 on many benchmarks and maybe even get comparable performance to GPT-4 in a lot of cases. 
(10:44):
So really exciting to have organizations going out and putting so much effort into curating these enormous open source data sets, making them available for us and releasing the model weights for us to get going with them right away. So thanks to Ed Donner, one of my co-founders at Nebula, my machine learning company, for pointing me in the direction of StableLM, making sure that I didn’t miss it and I didn’t miss a chance to create an episode for you folks on StableLM. Big new innovation, potentially very powerful for you to work with. 
(11:24):
So that’s it for my update for today. Until next time, keep on rocking out there folks, and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon. 
Show All

Share on

Related Podcasts