Podcastskeyboard_arrow_rightSDS 672: Open-source “ChatGPT”: Alpaca, Vicuña, GPT4All-J, and Dolly 2.0

17 minutes

Data ScienceArtificial Intelligence

SDS 672: Open-source “ChatGPT”: Alpaca, Vicuña, GPT4All-J, and Dolly 2.0

Podcast Guest: Jon Krohn

Friday Apr 21, 2023

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn


Beyond ChatGPT, what other large language models can help improve a company’s digital interactions with its customers?
 

Host Jon Krohn goes through four models that compete with ChatGPT’s capabilities, including some that you can use in commercial applications. Alpaca, Vicuña, GPT4All-J and Dolly 2.0 all have capabilities that let you train and run the large language models from as little as a $100 investment. Even better, many teams behind these models have quantized the size of the training data, meaning you could potentially run these models on a MacBook.

Listen in, to hear more about the data on which these models were trained as well as their methodological differences (conversations shared by actual users of ChatGPT vs generated natural language prompts).

(00:02): This is Five-Minute Friday on open source ChatGPT-like models you can train yourself on a single GPU.

(00:19): For last week's Five-Minute Friday, episode number 670, I introduced LLaMA, which is a set of powerful new large language models created by Meta AI. In particular, I highlighted how the 13 billion parameter LLaMA model achieves natural language generation performance comparable to GPT-3 at only a 13th of the size. So at that size, it can be trained on a machine with a single GPU. Well, this week I'm going to fill you in on four models that are even more powerful than LLaMA. So I'm going to tell you about Alpaca, Vicuña, GPT4All-J and Dolly 2.0. 

(01:01): Let's start with Alpaca. So Alpaca was created by Stanford researchers. It was released in early March, and it builds directly on LLaMA weights by taking the model weights from, say, the 7 billion parameter LLaMA model, and then fine-tuning that on 52,000 examples of instruction-following natural language. So 52,000 examples of an instruction and the response that is desired. And so the reason why this is so powerful for fine-tuning is that it allows for something like reinforcement learning from human feedback or RLHF, which I've talked about a lot in recent episodes. So it's this fine-tuning on and on instruction response pairs on this, on this instruction following natural language that allows models to be better than GPT-3 and allows them to be more like GPT-4, specifically this Alpaca model by Stanford researchers. It's as good as GPT-3.5. So what made GPT-3.5 so much better than GPT-3 is that it had this instruction following natural language. So this wasn't very expensive to create. They actually used GPT-3.5 to generate these 52,000 examples. And so that data generation using the GPT-3.5 API as well as fine-tuning the 7 billion parameter LLaMA architecture to be able to handle these instructions competently, all of that together, data generation and fine-tuning cost under $600. 

(02:45): So not very expensive, and the result is remarkable. So this 7 billion parameter model, according to the authors of the Alpaca paper, the Stanford researchers, achieved GPT-3.5-like performance. This is amazing because at the time that the Alpaca researchers released their paper, released this report, GPT-3.5 was the state-of-the-art because this was released in early March before GPT-4 came out in mid-March. So it's half the size of the LLaMA architecture I got you excited about last week at just 7 billion parameters and performs in a GPT-3.5-like way, which is a big, big improvement over GPT-3. And there actually was a free interactive demo that you could use for a while to use Alpaca online, but they took it down two weeks later due to issues like hallucinations that were causing the kinds of responses that weren't helpful.- There was a risk of abuse and the cost skyrocketed because lots of people wanted to use this Alpaca demo. 

(03:55): So while the interactive demo was taken down, the Stanford researchers still left up the 52,000 training data points. They provide code for generating those training data points if you want to augment that code or adapt that code to your own needs. And they also provide code for fine-tuning the model. So you can get the LLaMA model weights from a torrent. I'm not going to provide the link to those, but they are available. So they leaked while originally, as I mentioned in last week's episode, while the LLaMA model weights were supposed to be for research purposes only. Some research group took those and made them available publicly. So you can get those. And then, the Stanford researchers provide not only just the instructions for fine-tuning with their data, with the 52,000 data points that they provide, but they also just provide the parameter differences relative to LLaMA 7B. So you could just download LLaMA 7B and then take those parameter differences and you can end up with Alpaca.

(04:57): However, there's a big limitation here. So Meta specifically stated that LLaMA cannot be used commercially. And remember that the Stanford researchers created their Alpaca fine-tuning data, the instruction following natural language using GPT-3.5 and the OpenAI API rules state that you can't use them to create a model that is competitive with an OpenAI model. So for two reasons, you can't use Alpaca commercially. It's for research purposes only.

(05:31): The second model that I'm going to tell you about is called Vicuña. This one also can't be used commercially because it's based on LLaMA. And now also to explain where these names come from. So LLaMA, Alpaca, Vicuña, those are all names of real-life biological organisms. So those are three different species of South American camelid. So these are large herbivores with slender necks and long legs. They all look kind of similar. So Alpaca, LLaMA, and now Vicuña is this third model. So Vicuña was created by researchers at UC Berkeley, at Carnegie Mellon, at UC San Diego, and also actually at Stanford. And so they started again with the LLaMA model weights. So the 7 billion parameter model as well as the 13 billion parameter model. So they separately took those two models, and then they fine-tuned them on 70,000 user-shared ChatGPT conversations. So these are similar to the Alpaca dataset. These are instruction-following natural language pairs where we have an instruction and a desired response. So this, similar to Alpaca, allows Vicuña to be fine-tuned to be GPT-3.5-like.

(06:50): And these 70,000 users shared conversations, they come from a repository called ShareGPT, which is publicly accessible. And so the difference here is that Vicuña is using these user-generated conversations, whereas Alpaca was using GPT-3.5-generated natural language prompts. What's really interesting about the Vicuña model is the way that the researchers tested it. So while Alpaca was simply, the authors of that paper simply said, we looked at a lot of different results and those results were similar to GPT-3.5, therefore it's comparable. With Vicuña, they went a step further. So Vicuña was released in late March. And so this was shortly after GPT-4 came out in mid-March, and GPT-4 is capable. It's just so powerful. As I've talked about in recent episodes. It's capable of so many remarkable things. 

(07:52): And so what they did was they compared the results of their own model of Vicuña with results from other state-of-the-art ChatGPT-like models. So ChatGPT itself running GPT-3.5, Google's Bard architecture, LLaMA 13 and Alpaca 13. So they compared with all these different architectures, and I'm actually have provided for you in the show notes, a link for you to look at comparisons yourself. But as a simple example, if the question, if the prompt that they provided to the large language model was "Compose an engaging travel blog post about a recent trip to Hawaii", when they would ask two models to output the results. So their own model Vicuña as well as the one of the comparison models. So it could be ChatGPT, could be Bard, could be LLaMA, it could be Alpaca, and then they ask a GPT-4 to evaluate those two different results and to provide a score out of 10 as to how the model did at this task.

(09:03): And so there's lots of cool examples. So you'll see for the default example when you go to the Vicuña page that we've provided in the show notes is showing Alpaca 13 outputting a short blog post. It's more of a summary of a blog post, while Vicuña generates this big blog post. And then the GPT-4 evaluation gives Alpaca a seven out 10, it gives Vicuña a 10 out of 10, and it explains why it gave those differences. It's really cool. And so while it's not a super scientific to use GPT-4 to do this, I think it's a really cool way of evaluating the model and it shows you another really cool use case for GPT-4. And the results of this show unequivocally, although again, not super scientifically that Vicuña is clearly superior to LLaMA and Alpaca and it's even competitive to Bard and to ChatGPT running GPT-3.5.

(10:00): So very powerful model Vicuña and again, did not cost very much to train. So the 7 billion parameter Vicuña model cost just $140 to train. The data were available freely already from ShareGPT. The 13 billion parameter model cost just $300 to train. The problem with Vicuña is that like Alpaca, it is based on LLaMA model weights. So while the fine-tuning data from ShareGPT are publicly available, you can use them for any purpose including commercial purposes because Vicuña like Alpaca is based on LLaMA model weights, you, you can't use it commercially.

(10:41): However, just a few days ago, and I don't mean a few days ago relative to when I recorded this Five-Minute Friday episode, I mean a few days ago relative to when you could possibly be hearing this episode on the release date, a company called Nomic AI, which describes itself as a small information cartography company that it's super small, I can only find five employees of Nomic AI on LinkedIn, they released just days ago a model called GPT4All-J. So in this case, they did not use the LLaMA model ways to start. So they started off with an open source Apache-licensed GPT-J model. This is a 6 billion parameter model that was created by EleutherAI. This is a nonprofit AI research group, and they created this 6 billion parameter GPT-J model to be freely available, permissively available, including for commercial use. So that was the starting point. And so that would've given approximately GPT-3 like outputs. So that would've been comparable to the LLaMA architecture that I talked about last week. Then in order to fine-tune it to be more GPT-3.5-like or more GPT-4-like in its capacity to anticipate what humans are looking for when they provide instructions, they used an 800,000 instruction and response data source that they created themselves. 

(12:14): So for comparison, Alpaca was using just 52,000 examples, Vicuña was using 70,000 examples. So this is way more data in order of magnitude more instruction response data for fine-tuning. And they detail all of the various sources that they used in their technical report, which I've provided in the show notes. But the key point here is that it's 800,000 of these instruction-response natural language pairs. And all of those data sources that they used were publicly available. So this means that the entirety of GPT4All-J, the original weights, model weights, as well as the data that they used for fine-tuning are publicly available, open source. 

(13:04): And so Nomic AI also made GPT4All-J available with a commercial use Apache license. And in addition to that license and the model weights, they also provide their data curation procedure, they provide their training code, they provide a quantized 4-bit version of their model. So this means, quantized means that we reduce the size of the data type so that it doesn't take up very much space. And this allows the GPT4All-J model to be fit onto a good laptop CPU, for example, like an M1 MacBook laptop. And so this means that you can probably not train with any reasonable speed, but at least you can use GPT4All-J, this quantized 4-bit version, for inference, for generating results for conversing with in a ChatGPT-like way on a standard modern laptop. That's really cool. 

(14:06): They say that it cost about $6,000 to train this GPT4All-J model. And because this is brand new, just released a few days ago, it hasn't been evaluated as thoroughly as Vicuña, but it appears to perform comparably to the LLaMA and Alpaca architectures that I've already talked about. So I couldn't find anything of it compared to Vicuña. It's possible that Vicuña outperforms it just like Vicuña outperforms LLaMA and Alpaca, but GPT4All-J is available for commercial use. 

(14:39): Another architecture released just a couple of days before GPT4All-J that deserves honorable mention is Dolly 2.0. So this was released by the database giant Databricks, and this Dolly 2.0 model is based on another model from EleutherAI. So the GPT-J model, the GPT4All-J is based on that was also from EleutherAI. The one for Dolly 2.0 was a bit bigger. So they, there was a 6 billion parameter model used for GPT4All-J. For Dolly 2.0 it was a 12 billion parameter model, but again, completely open source. And this one, Dolly 2.0 is fine-tuned on 15,000 human-generated instruction response pairs created by Databricks employees. And again, these 15,000 prompts are publicly available. They are available open source from Databricks. And so this Dolly 2.0 model like GPT4All-J commercial use is okay. However, again, because it's brand new, there was little evaluation data that I could find available at the time of recording this episode, but this provides you with another exciting option to explore. 

(15:51): All right, so the reason why this should be so exciting is because in this Five-Minute Friday episode, which ended up being much longer than five minutes, but I had a lot to tell you, and what I've told you about is these open source ChatGPT-like large language models that you can fine-tune yourself on a single GPU, say using your own proprietary data. And this enables you to economically train and run in production maybe for just a few hundred dollars or a few thousand dollars, you can train models and then run in production your own ChatGPT-like natural language generation model that can handle proprietary use cases for your users. This is huge. What a terrifically exciting time to be a data scientist. 

(16:36): All right, that's it for today. Until next time, keep on rocking it out there folks, and I'm looking forward to enjoying another round of the SuperDataScience Podcast with you very soon. 

Show all

arrow_downward

Share on