(00:05):
This is Five-Minute Friday on the groundbreaking new natural language model LLaMA.
(00:19):
I hope you enjoyed our recent tour of GPT-4. In episode number 666, I introduced GPT-4’s remarkable capabilities in 10 minutes. Then in episode number 667, we had Vin Vashishta, an expert in AI monetization on to talk about commercial opportunities, the unprecedented commercial opportunities associated with the release of GPT-4 to the public. And then episode number 668, we had Jeremie Harris, an AI safety expert come on to talk about the risks associated with the public release of GPT-4.
(00:51):
So today’s episode is all about a new model released on February 24th by Meta AI, so the AI research organization within the big tech company Meta. And this algorithm is called LLaMA. So like the, like the South American animal LLaMA, LLAMA. So this is a play on LLM, the large language model, and then they add in a couple extra A’s to make it LLaMA. So kind of a cute name there. And the idea with LLaMA is that it’s based on the Chinchilla scaling laws. So there was a big algorithm released last year by Hoffmann and colleagues. And perhaps we’ll do an FMF in the future on Chinchilla, even though it’s something from last year. But there’s these scaling laws in there that are particularly interesting and maybe deserve an episode all to themselves. 
(01:58):
And so what these scaling laws state is that training a smaller model for longer, results in a better performing model. So this is specific to large language model research. So in the same way that GPT-4 is a large language model, and GPT-3 before it was, and Chinchilla is another example of one, these large language models are characterized by having a certain number of transformers in them. And these transformers tend to be pretty large in terms of the number of model weights on their own. And then you might have many dozens of them in a large language model. And so this means that it’s routine now for large language models to have hundreds of millions of parameters.
(02:42):
But the idea with Chinchilla a year ago, as well as with this new LLaMA paper is that, or this new LLaMA model, is that we should be able to get an even better performing model if we train it for longer, and that is true, even if the model is very small. So I’ll get into some specifics shortly with LLaMA. So LLaMA, like all the other top performing natural language processing models of today is a transformer architecture. And in the LLaMA paper they talked about four different model sizes varying in size from 7 billion parameters up to 65 billion parameters. And so this compares as a kind of rough benchmark, GPT-3 has 175 billion model parameters. So the smallest LLaMA model, LLaMA 7B has 7 billion parameters. And that, so that’s just a 13th of the size. The biggest LLaMA model with 65 billion parameters is about a third of the size of GPT-3.
(03:55):
So what they wanted to do in this experiment is say, okay, let’s start with this Chinchilla principle of taking relatively small models. And at 65 billion, that still is pretty big, but taking models that aren’t absolutely enormous and training them for even longer than they recommended in the Chinchilla paper from a year ago with these Chinchilla training laws. So they were, they wanted to experiment and see what happens if we take these large language models and train them for way longer than people have before. And they used a lot of, and they used entirely open source data for this. So this is different from the other top performing models of last year. So Chinchilla, I already mentioned GPT-3 I already mentioned. And also by the way, if you want to hear a big, long SuperDataScience episode featuring one of the GPT-3 first authors, Melanie Subbiah, you can check that out in episode number 559. So that was one of the big models last year. So Chinchilla, GPT-3 and then the other one was PaLM, and that was out of Google. I talked about that in detail in episode number 568. So all three of those models Chinchilla, GPT-3, PaLM, we don’t know exactly what data, it’s never been published exactly what data were used to train those top-performing models from last year. 
(05:21):
However, a cool thing with LLaMA paper is that they did tell us. So most of the data come from the English Common Crawl dataset. So this is a big web crawl that’s publicly available that constituted 67% of the training data for the LLaMA models. They also used a dataset called C4, which is a clean pre-processed variant of the Common Crawl. And that took up 15% of the data. So between the two of them, the regular raw English Common Crawl, as well as the C4 processed Common Crawl dataset, that was taking up the vast majority of the training data. So that comes out to 82% of the data came from, from the Common Crawl. And you might be wondering why they would use pre-processed common crawl data as well as raw data, and that they did that because previous studies have shown that combining the two kinds, the two variants of this dataset as well as other natural language data sets when you have the raw data as well as the pre-processed, some other kind of processing, some other variant of it that improves your model performance. 
(06:31):
So they included both of those Common Crawl data sets. They also included GitHub to have some code in the training data. They included Wikipedia, they included two big book corpora, so Project Gutenberg as well as another one called Books Three. They included Archive to have lots of science information in there, and they included Stack Exchange to have lots of technological question and answer available in the training data as well. So, so all those open source data sets were combined together to train these LLaMA models. Very cool that you could theoretically train these models yourself or, or a similar kind of transformer model using the open source guidelines that LLaMA, that the LLaMA authors from Meta have provided. So you probably wouldn’t wanna do that yourself though, unless you have an absolutely enormous budget like Meta does, because the whole point here is that they’re training these models for way longer than ever before, much longer than even the Chinchilla paper from last year, which was the whole point of that again, was to be training for a really long time.
(07:35):
So did it work? Did training for super long create a better model? Yes, it did. So for example, the 13 billion parameter variant of LLaMA, so LLaMA 13B, it outperforms GPT-3, which again has 175 billion model parameters. So 13 times as many model parameters in GPT-3 and LLaMA 13 outperforms GPT-3 on most of the benchmarks that these authors tested them on. So that’s wild. So this means that with LLaMA 13, you can get GPT-3 like performance on a single GPU, whereas GPT-3 requires many GPUs, like a dozen GPUs. I mean, it depends on exactly what kind of GPU you’re using, but that’s the kind of scale that we’re talking about here. We’re talking about being able to compress down a model into a 13th of the space and get the same kind of results. 
(08:35):
And so this is again, from training for longer it’s showing that we’re getting these really, really, really powerful results. So thanks to the Chinchilla folks from last year for kind of opening that idea up. And as another example, so the, remember the biggest LLaMA model is LLaMA 65B, so 60 billion parameter model, and that one is competitive with the very best model. So, this was before GPT-4 was released, that LLaMA was released on February 24th. So GPT-4 was released on March 14th. So at least by the end of February, the kinds of the top performing natural language models that we had were Chinchilla 70B. So the 70 billion Chinchilla model, the biggest one from Chinchilla last year, and the PaLM model from Google, which I already mentioned, and that one’s gigantic. It has half a trillion model parameters, 540 billion model parameters. 
(09:32):
So this LLaMA 65 billion model, 65 billion parameter model was competitive with this gigantic PaLM one. Now, I suspect if they were to retest today now that the GPT-4 model has been released, that GPT-4 would probably outperform LLaMA 65B, but we don’t even know, OpenAI has not published how many model parameters there are in GPT-4. Now, who knows, maybe they’ve even caught onto this kind of literature and maybe they are training a smaller model than GPT-3 for a much longer time. It’s possible, it’s possible. I haven’t read that anywhere, I’m just, it’s just complete conjecture and there’s nothing else that I can do, but have conjecture given that GPT-4 details have not been published. So these LLaMA models are a big deal. The main point being that if you train a large language model for much longer, you end up with much higher performance, even if that model is relatively, in this scheme of contemporary LLM’s, relatively small. 
(10:35):
So in my view, the best takeaway here is that the LLaMA 13 billion parameter model, so LLaMA 13B, it gets GPT-3 level performance. It’s relatively cheap inference. It has a smaller carbon footprint, and you can train it relatively inexpensively because once again, you can fit it onto a single GPU instead of the kind of say, dozen GPUs that you’d have to fit GPT-3 on. So this should be really exciting to you because it means that you can train, you can fine tune your own proprietary state-of-the-art LLM by using LLaMA model weights. And you might be wondering, well, did they publish the model weights? Can I just download those? Well, not exactly. So the Meta AI folks said that you could reach out to them and if you were a researcher that they would provide the LLaMA model weights to you, but somebody leaked these model weights and so they can be torrented.
(11:39):
And so the kind of gotcha here is that Meta AI does not permit commercial use of LLaMA or derivatives of it. So if you’re planning on using this model commercially there is that limitation there. But in terms of research you can absolutely use those model weights yourself. And indeed several research groups have taken those model weights and fine tune them in really exciting ways using really thoughtful data sets. And so those models include things like Alpaca, Vicuna and GPT4ALL. So these are models that build on LLaMA, on the LLaMA 13 billion parameter model in particular, and fine-tune it on some clever data sets, and they get even better performance than LLaMA 13B. And so they do better than GPT-3.5 and in some cases GPT-4 on some tests without needing to become very large. So really exciting. And so that topic I’m gonna talk about in next week’s Five-Minute Friday, I’m going to dig into those Alpaca, Vicuna and GPT4ALL models, that build on the already really exciting LLaMA results that I talked about today. 
(13:08):
All right, I hope you enjoyed this Five-Minute Friday on LLaMA. Until next time, keep on rocking out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.