(00:00):
This is Five-Minute Friday on Llama 2.
(00:19):
Llama 2 was released by Meta, the big tech giant, on July 18th. The model is a big deal. I haven’t had an episode on an open-source LLM in a while and there’s a reason why I am today. So let’s go through the key characteristics of Llama 2. So it’s open-source, unlike the original LLaMA, however, it can also be used commercially as long as you have fewer than 700 million monthly active users for whatever application you’re gonna use it for. So this basically is preventing Meta’s biggest competitors like Google from being able to use Llama 2 and that’s about it. Llama 2 is pre-trained on 2 trillion tokens of natural language data, which is 40% more than the original LLaMA 1.
(01:09):
Like the Alpaca and Vicuña models that used LLaMA 1 as a pre-trained starting point, I talked about this more in episode number 672 if you want to hear more about that, so yeah, so like we talked about in episode 672, like the Alpaca and Vicuña models, there’s a “Llama 2-chat” variant which is fine-tuned for chat applications. So it takes the Llama 2 pre-trained base model and unlike LLaMA 1, Meta have now gone ahead and themselves fine-tuned LLaMA to this chat use case. And so yeah, this is comparable to the Alpaca and Vicuña models that came before but we’re not made by Meta. And yeah, so this means that now this “Llama 2-chat” variant is really amazing in chat applications because it’s fine-tuned on a data set of over a hundred thousand publicly available data points as well as over a million human annotations. So this means that the Llama 2 model family has two different variants. You’ve got the pre-trained variants which are not chat-fine-tuned, and then you have the chat-fine-tuned variants.
(02:20):
Within those two variants there are four sizes of model, of Llama 2 model. So you’ve got a 7 billion parameter model, a 13 billion parameter model, 34 billion and 70 billion. So the 7 and the 13 billion parameter models, whether we’re talking about the pre-trained version or the chat-fine-tuned version, this is really convenient because those 7 billion and 13 billion parameter models fit on a single GPU. So they’re relatively inexpensive to train as well as to perform inference within production. The 34 billion parameter models were not released publicly by Meta and I’ve got more coming up later for why I think that is. And then finally the 70 billion model parameter models offer the best performance on a broad range of natural language generation benchmark tasks of any open-source LLM so far today. So got that? So we’re talking about eight models here because we’ve got in the Llama 2 family we’ve got pre-trained variants as well as chat-fine-tuned variants.
(03:21):
And within both of those categories there are four model sizes, 7, 13, 34 and 70 billion model parameters. The biggest Llama 2 variant, the 70 billion one, specifically the chat-fine-tuned one, offers ChatGPT-level performance on a broad range of natural language benchmarks. That makes Llama 2 the first open-source model to do this convincingly. And you can prove this to yourself, you can experience this yourself by the free Hugging Face chat interface that uses the 70 billion parameter chat-fine-tuned Llama 2 model in the backend. I’ve got a link to that in the show notes. And so this makes Llama 2 now by a considerable margin the leading open-source LLM and yeah, the only one that can compete with ChatGPT. You could see the Llama 2 page which I’ve also included in the show notes for a table of details across 11 external benchmarks and according to Meta themselves, so perhaps take this with a grain of salt, this shows how the 13 billion parameter Llama 2 model, again one you could fit on a GPU is comparable to the much larger 40 billion parameter Falcon model that was previously the top-ranked open-source LLM across a range of benchmarks.
(04:36):
And yeah, again the, looking at that same table, the 70 billion parameter Llama 2 model sets the new state of the art. And on some of these benchmarks it does so by a considerable margin. One weak spot, however, for Llama 2 is tasks involving code or math. So Llama 2 is not gonna compete with ChatGPT on those kinds of tasks code or math tasks and it won’t even be the best open-source option out there for code or math tasks. I’m sure Meta are thinking about that for the LLaMA 3 release. In terms of some other cool properties that the Llama 2 models have, they have time awareness. So this is a cool thing that I was reading about in the Llama 2 technical paper. And so if you ask the model “Is the earth flat or round?” and you give it a context of the present day of 2023, it’ll answer the question and say that it’s round of course, but if you ask it in the context of the year 800, it’ll give you a different answer focused on why the earth is flat. So there’s this really cool time awareness that you can play with in your questions to Llama 2.
(05:42):
Another major major thing about Llama 2 is that it has double the context window relative to the original LLaMA. So the original LLaMA as well as most open-source LLMs that have been released so far has a context window of about 2,000 tokens. But Llama 2 has a context window of 4,000 tokens, which is a big jump because this means that you can feed in about 16 pages of context now instead of just eight. So that means that there’s a lot of new kinds of documents and use cases that are applicable to Llama 2 that you wouldn’t have been able to do with the original LLaMA. In order for the chat capabilities to be so powerful, the Llama 2 folks are using a two-stage reinforcement learning from human feedback, RLHF process and this is probably the key to its outstanding generative capacity relative to all other open-source LLMs so far.
(06:36):
The first stage of this two-stage process uses something called rejection sampling. While the second stage does this rejection sampling with something called Proximal Policy Optimization or PPO. And so PPO is a widely used approach across both open-source as well as commercial LLMs and however it’s this rejection sampling thing in this two-stage process that is the new thing. I don’t have time to go into the technical details, but you can search to learn more about that on your own. A final really cool thing about LLaMA in terms of capabilities that it has is it takes advantage of a new method called Ghost Attention. And so this Ghost Attention allows Llama 2 to perform especially well in multi-term conversations. So these are conversations where there’s lots of back and forth and you want it to be able to remember context from earlier in the conversation from things AI, the model said, as well as things that you said. And so this means that with Ghost Attention, for example, you can ask the model to respond only in emoji throughout the conversation, and it’s likely to retain that context genuinely throughout the conversation and just provide you with emoji as outputs, which is pretty fun.
(07:45):
To get all of this for Llama 2 to be so successful an estimated $25 million was invested in it just to train it. And on top of that, there has been extensive safety and alignment testing, probably more extensive than there has been for any other open-source LLM. Again, this is Meta self-reported data, but charts from the Llama 2 technical paper show that AI safety violation percentage, which you want to be low, are far below any other open-source LLM with Llama 2, and they’re even better than ChatGPT. So it seems at least on these Meta self-conducted safety tests that Llama 2 is performing at the state-of-the-art, not only across open-source architectures, but perhaps commercial ones as well.
(08:33):
The exception here is that, remember how I said there were four families, four sizes of models in the Llama 2 family, you got the seven, the 13, the 34, and the 70 billion. And earlier in the episode I was saying how the seven and the 13 are really great because they fit on a single GPU. The 70 is really great because it provides a state-of-the-art, but what about 34 billion parameter model? I mentioned earlier in the episode that it was not released publicly. And I think this is why if you go and look at the charts of safety for whatever reason, the 34 billion parameter Llama 2 model in Meta’s own charts in their technical paper, it has much higher safety violation percentages than the other Llama 2 models. Still better than open-source, other open-source options out there, but probably not safe enough for Meta to feel comfortable releasing it.
(09:27):
All right, so the conclusion is that if you are fine-tuning open-source LLMs for your own applications such as by using Parameter Efficient Low Rank Adaptation which allows you to fine-tune these LLMs very cheaply for maybe tens or hundreds of dollars if your dataset is pretty small for fine-tuning, I talk about that a lot in episode number 674. And so yeah, if you do that kind of fine-tuning with open-source LLMs, then the time has come to probably upgrade to using Llama 2. At my machine learning company Nebula.io where we were previously using Dolly 2.0 as our starting point, we have now upgraded to Llama 2 as our starting base model for fine-tuning and we’re delighted by the results.
(10:13):
To get access to Llama 2, you need to fill in a form online, which I’ve linked to in the show notes, but for us at least, we were approved and provided with access to Llama 2 same day. One thing to note is that when you start using Llama 2 to generate responses yourself, be sure to experiment with the temperature parameter because for example, for creative tasks, you’ll want a higher temperature than if you’re asking Llama 2 to answer factual questions.
(10:38):
All right, there you go. That’s your big Llama 2 update. Get building. Until next time, my friend, keep on rocking out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.