Podcasts SDS 713: Llama 2, Toolformer and BLOOM: Open-Source LLMs with Meta’s Dr. Thomas Scialom

86 minutes
Artificial Intelligence, Data Science

SDS 713: Llama 2, Toolformer and BLOOM: Open-Source LLMs with Meta’s Dr. Thomas Scialom

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn

In this episode, Meta’s AI Research Scientist Thomas Scialom gives us behind-the-scenes insights into developing Llama 2 and what’s in the works for Llama 3. With host Jon Krohn, he discusses the future of Artificial General Intelligence, and why the Galactica science-focused LLM was taken down and what he learned from it.

Thanks to our Sponsors:

Interested in sponsoring a Super Data Science Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Thomas Scialom

Dr. Thomas Scialom is an A.I. Research Scientist at Meta. He is behind some of the world’s best-known Generative A.I. projects including Llama 2, Toolformer, CodeLlama, Galactica, Bloom, Unnatural Instruction and more. Thomas has lectured at many of the top A.I. labs (e.g., Google, Stanford, MILA) and is contributing to the development of Artificial General Intelligence (AGI). He holds a PhD from Sorbonne University, where he specialized in Natural-Language Generation with Reinforcement Learning.

Overview

As a user of Llama 2, Jon has a great deal to ask Thomas in this week’s episode. Thomas attributes the degree of innovation in Llama 2 to the capabilities of open sourcing which also helped bring about shorter cycles between developments. Jon spoke to Thomas about the team’s focus on ethics and responsible use, where they simulated malicious actors during the training process to help add a safety metric to the model. Currently, the team is scaling the models by training them on more tokens, how to access the web, and how to handle tools.

Thomas already addressed the latter concern with Toolformer. This tool was developed to specialize in deciding which API to use in a query and how to incorporate those results in helping to predict successive tokens. Toolformer managed this because Thomas and his team gave the model a set of tools and taught it to use them based on contextual variables. This approach dramatically extended its capabilities, and companies have been able to use Toolformer to integrate a natural language chat into the company platform that can query APIs.

Thomas then addresses the costs of human labor to curate a dataset for LLMs. By getting users to plug their interesting conversations into ShareGPT, teams like those at Berkeley could use those datasets and finetune the Vicuna LLM, which has great performance compared to other open-source models. Thomas hopes that open-source models can continue using crowd-sourced data to reduce the front-end training costs.

Thomas also gets circumspect about entrepreneurship and AI. The difficulty lies not in starting up a company, as there are so many possibilities, but in ensuring it stands the test of time—many startups go bust within a year because they have been unable to keep up with industry changes.

Listen to this episode to hear more about why Thomas believes it is still a great time to start creating and innovating, what the future holds for General Artificial Intelligence, and what happened to Galactica – the language model for science – and what can be learned from its demise.

In this episode you will learn:

Llama 2: Behind the Scenes of Today’s Top Open-Source LLM [05:04]
Responsible use of Llama 2 [15:26]
Toolformer: LLM That Learns How to Use External Tools [24:57]
Galactica: The Science-Specific LLM and Why It Was Brought Down [36:57]
Is AGI Around the Corner? [57:03]
Advice for AI entrepreneurs [1:05:46]
How Thomas develops and manages large-scale AI projects [1:14:42]

Items mentioned in this podcast:

Follow Thomas:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 713 with Dr. Thomas Scialom, A.I. Research Scientist at Meta. Today’s episode is brought to you by AWS Cloud Computing Services, by Grafbase, the unified data layer, and by Modelbit for deploying models in seconds.

00:00:21

Welcome to the Super Data Science podcast, the most listened-to podcast in the data science industry. Each week we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:52

Welcome back to the Super Data Science Podcast. Today we’ve got the trailblazing AI researcher Dr. Thomas Scialom on the show. Thomas is an AI research scientist at Meta. He’s behind some of the world’s best-known generative AI projects, including Llama 2, BLOOM, Toolformer, and Galactica. He’s contributing to the development of Artificial General Intelligence, AGI. He’s lectured at many of the top AI labs such as Google, Stanford, and MILA in Montreal. He holds a PhD from Sorbonne University in France where he specialized in natural language generation with reinforcement learning.

00:01:25

Today’s episode should be equally appealing to hands-on machine learning practitioner as well as folks who may not be hands-on, but are nevertheless keen to understand the state-of-the-art in AI from someone who is right on the cutting edge of it all. In this episode, Thomas details Llama 2, today’s top open-source LLM, including what it was like behind the scenes developing it and what we can expect from the eventual Llama 3 and related open-source projects. He talks about the Toolformer LLM that learns how to use external tools, the Galactica science specific LLM, why it was brought down after just a few days, and how it might eventually reemerge in a new form. He talks about RLHF reinforcement learning from human feedback, which shifts the distribution of generative AI outputs from approximating the average of human responses to approximating excellent, often superhuman quality. He talks about how SUNY thinks AGI, artificial general intelligence, will be realized and how, and how to make the most of the generative AI boom as an entrepreneur. All right, you ready for this tremendous episode? Let’s go.

00:02:35

Thomas, welcome to the Super Data Science Podcast. It blows my mind that you’re here on the show, that we get to have you here. I’m so excited for this interview. Where in the world are you calling in from today?

Thomas Scialom: 00:02:47

From Paris.

Jon Krohn: 00:02:49

Nice. It’s been a while since I’ve been to Paris, but I’ve never had a bad time there.

Thomas Scialom: 00:02:55

Yeah, neither.

Jon Krohn: 00:02:59

Nice. So, we know each other I’d say almost serendipitously. I did an episode a couple of weeks ago on Llama 2, so episode 702 is this, I don’t know, it’s like a 15-minute, maybe 20-minute episode with just me describing from my understanding all the new capabilities with Llama 2, how the model came about a little bit. And as I was opening up the technical paper, there’s like I don’t know how many, there’s probably like 50 authors and they’re in this big long list, listed vertically on the side of the technical paper page. But somehow my brain noticed that I recognized one of them. I was like, “Anthony Hartshorn. I know Anthony Hartshorn. There can’t be two people named Anthony Hartshorn.” And so I sent him a message and I said, “Do you want to be on my podcast? We’re the most listened to podcast in the data science industry.” And he suggested you as the guest instead, Thomas, which is amazing because you’re the final author on the paper, which in the academic world it might sound to a normal listener like being the final author should mean that of the 50 people we have the person that made the smallest possible contribution, but in fact on academic papers, that isn’t how it works.

00:04:26

So, you have very often the first author is maybe the person who actually wrote, put everything together, but then traditionally in academic work, the last author will be the head of the lab that brought in the funding and that was kind of overseeing the project. So, truly it’s an honor to have you here, Thomas.

Thomas Scialom: 00:04:48

Thanks for having me.

Jon Krohn: 00:04:49

So, at the time of recording this episode, it’s only been a few weeks since Meta released the open-source large language model, Llama 2. You were a science and engineering leader for this groundbreaking development. Can you explain the significance of Llama 2 in the context of other recent advancements in AI and generative models? Maybe kind of fill us in on how the Llama projects in general, that Meta was like, you know what, we’re going to invest and obviously you’re not going to divulge on air, but there’s rumors that eight figure sums have been invested in creating Llama 2. And so it’s interesting, even from the very beginning, what was it like maybe to get this kind of buy-in from the organization to be doing this open sourcing?

Thomas Scialom: 00:05:39

Yeah, I think, so no doubt large language models are a big deal. They have made some breakthrough in the research. I think also we had a ChatGPT moment at the end of last year and most of the people realize the potential of this technology. And so I think we did mainly two things with Llama 2. One, we, what we call align some model with techniques called RLHF, for instance. I can dig more in depth later if you want this, but basically the idea is you have what we call a pre-train model, which has kind of reads internet on the next token prediction. So, it tries to predict the next token, and this is what we call self supervision. It’s supervision because we have a target, but it’s self because text on the web are vastly accessible like that. And so just with that you have a pre-train language model, which we had with LLaMA 1 and we did again with Llama 2 and extended it a bit incrementally.

00:06:46

And that’s where all the new edges learn, all the capabilities kind of emerge, but then it’s hard to access. And the magic behind ChatGPT is its kind of interface as a chat, which is very natural. And to follow your instructions to say, “Oh, but talk like this person, or do these kind of things. Or make it more like a markdown or bullet point or change that or make it shorter.” And it understands your instructions and does it precisely. And this happens at fine-tuning, it’s kind of refining educating a pre-trained large language model, which we did also with Llama 2. And that was one of the main innovation, because I mean no one had done that at this scale and open-source a model and explaining all the research behind in a research paper as we did. So, before Llama 2 basically the only large language model online that were available, like OpenAI, Anthropic, Google with Bard, they were closed behind an API. So, I would say that’s the main innovation in term of science and in terms of impact for the communities, the research communities, the business. I think you mentioned, and you’re not the only one, your company now use Llama 2. This is also possible because we also change the license to something commercial, user-friendly for commercial applications.

00:08:14

This is also possible because we also change the license to something commercial, user-friendly for commercial applications.

Jon Krohn: 00:08:23

Yeah, exactly. I don’t have 700 million users at my machine learning company yet. So, this commercial license that allows, as long as you don’t have more than 700 million active users, it’s okay to use Llama 2. So, for us, it’s brilliant. Previously we had been using as our base model, so we have a number of different kinds of generative AI capabilities in our platform for our users. And so something like LLaMA 1, which was pre-trained but not fine-tuned, that would’ve been actually fine for us as a starting point except for the commercial use limitations. So, we never could use the original LLaMA in production because obviously there was this commercial use restriction. It was for academic purposes only. And that also meant that some of the initial fine-tuned architectures that came off the back of LLaMA, like Alpaca out of Stanford and like Vicuña that Joey Gonzalez, who was in episode number 707 of this show, developed at Berkeley.

00:09:41

And so all of those, that whole family of models, we were like, “Oh man, we’re going to be left out.” But then luckily some groups did come along. So, Databricks released Dolly 2.0 for example, and there was some other, and I’ve done episodes on these open-source alternatives that are commercially licensable. So, episode 672, I talk about different open-source options that are available where you not only have that pre-training with the self supervision that you were describing, but also the fine-tuning based on human feedback. That means that the responses are going to be deliberately helpful and more like a conversational, like a chat.

00:10:26

So, we had been using Dolly 2.0 from Databricks as our starting point for the last couple of months. When Llama 2 came out, there was something… The scale, you described this already, the unprecedented scale in terms of the number of tokens, two trillion tokens for pre-training and over a million data points for the fine-tuning, this kind of scale, its orders of magnitude more. The Dolly 2.0 for comparison had 10,000 instructions that were fine-tuned on. You’re talking a hundred times more. And with these large language models, the scaling laws that we’ve seen come out, like the Chinchilla scaling laws, have showed that you kind of have three levers to getting a great model. So, the number of parameters, the training dataset size and training time. And it seems like with Llama 2, you and your team have tried to max out all of those things, especially with the 70 billion parameter Llama 2 model.

00:11:33

So, that’s I guess something that’s also worth, if people haven’t listened to my Llama 2 episode already, then you may not be aware that it isn’t just one model that was released here. We’re talking about a model family. So, there’s a 7 billion, 13 billion and a 70 billion parameter model. And those two smaller ones, they’ll be able to fit on a single GPU. And so this means that you can run them relatively inexpensively. And so with applications like with my company where we have a relatively discreet number of generative tasks that we need the model to perform, we can take that 7 billion or that 13 billion and we can fine-tune it to our tasks. And so for listeners who aren’t aware, you can do this yourself at home using a parameter efficient fine-tuning technique like LoRA, low-rank adaptation, which I talk about in episode number 674.

00:12:25

So, you can take the model like Llama 2, and so the 7 billion, 13 billion, you can typically vary inexpensively for tens of dollars or hundreds of dollars, you can fine-tune that to your own specific tasks. And for us, that’s perfect. It means we now have this amazing large language model that it’s as good as GPT-4 or better in our own tests when we start with Llama 2 and we fine-tune with our own data at this narrow range of tasks that we have. And then if you’re a listener out there and you’re like, “Well, I want the absolute state-of-the-art,” then you can use Llama 2. And at least in terms of open-source, this is going to be the state-of-the-art. So, I’ve just talked a lot. But the point is that, Thomas, what you’ve done and what this means for us as a community to have access to something like Llama 2, it’s a game changer. It was obvious that it was a game changer within minutes of starting to read the Llama 2 materials online and my data science team at my company immediately started retraining our models with Llama 2.

Thomas Scialom: 00:13:34

It’s always good to hear. Thanks. Maybe worth mentioning what we realize also is, so it was extended in context length from two to 4,000, et cetera. It’s on text only for now. But I think that’s also the magic of open sourcing. We don’t want to push for access, for as a community we will deal with that easily. And we know that extending the context length, that fine-tuning is possible. We know that connecting multi-modal inputs is straightforward. And what was magic is after the release, within a week people have done that efficiently. And so that’s also one of the strengths, in my opinion, of open-sourcing this kind of models. And we see much more innovation with shorter cycles of innovation thanks to that. So, that was one of the philosophies. So, we went, as you said, all in on the scale of the things that we can do at Meta to make it as good as we can so that everyone could use it in the end to adapt it for the use cases.

Jon Krohn: 00:14:45

Amazing.

00:15:21

Are you stuck between optimizing latency and lowering your inference costs as you build your generative AI applications? Find out why more ML developers are moving toward AWS Trainium and Inferentia to build and serve their Large Language Models. You can save up to 50% on training costs with AWS Trainium chips and up to 40% on inference costs with AWS Inferentia chips. Trainium and Inferentia will help you achieve higher performance, lower costs, and be more sustainable. Check out the links in the show notes to learn more. All right, now back to our show.

00:15:25

And another thing that you did with Llama 2 is there’s extensive thought around ethics, responsible use, acceptable use. So, for example, there were red teaming exercises where you simulate internally that you have these malicious actors. And so can you dive into why this was so important? I think this was unprecedented also. So, not only was the amount of data for both the pre-training and the fine-tuning steps unprecedented, but for an open-source model, I think that the level of concern that went into the ethics and responsible use is also unprecedented.

Thomas Scialom: 00:16:14

So, yes, maybe let’s give a bit of context. The strongest LLMs so far were, as we said, accessible only on an API. I think that was problematic in several aspects. It’s led on research, it prevent academia to explore, industrial, to have commercial use cases. And to be honest, we will be nowhere without open sourcing. Think about BERT, Transformers and even GPT-1. That being said, the risks at present and future with respect to learn have been arguably discussed by some of the researchers. I think OpenAI and Anthropic did an extremely great and important invaluable job at tracing the bar for safety. And I’m glad they did. So, the thing is when you have an API like them, it’s easy to control, you can put classifiers on top of that, you restrict the access somehow.

00:17:15

There’s clearly a very hard challenge when it comes to open-source, because you release the weights and you enable everyone to fine-tune, to do whatever, to control the models. So, while I feel it is very important to do it and I think we’re not yet at a stage where LLM’s are so dangerous that we should not do it. It was important to do it in a responsible way to raise the bar even higher than what has been done for competitor models driven API, because the risks are bigger when you open-source it. And so we had a lot of inspiration for the works that were done at those companies at OpenAI, Anthropic, and we apply all the method we could and some new methods we discussed in the paper to make the model as safe as we could. It’s not perfect. There’s still some jailbreak, but maybe we can discuss that later, but I feel we had two main complaints that followed the release. And one of them was, it’s too safe. And there’s an example where for instance, I don’t remember, can you kill the script or something like that and the model say, “No, it’s not good to kill.”

Jon Krohn: 00:18:29

Right, right, right.

Thomas Scialom: 00:18:30

So, I mean, well, there was a system from top of it. If you remove it, the model is actually better. But to me this was a success in that this was the first time we release an open-source, a model of scale, and so we had the responsibility to raise a bar for safetyness and responsibility. So, because it was unprecedented, I prefer to be on the side that it’s too safe and progressively decrease the level of safety if needed for future release than the opposite.

Jon Krohn: 00:19:08

And so actually your discussion of that reminds me that when I was doing my research for my solo episode about Llama 2, episode number 702, with that episode, when I was digging into your technical paper, it actually talks about four models. So, three models that were released were the 7 billion, 13 billion and 70 billion parameter models. And then off the top of my head, I think that it was 34 billion was another model that you trained. But I noticed that, for whatever reason, there was a chart with some metric of safety.

Thomas Scialom: 00:19:47

Absolutely.

Jon Krohn: 00:19:48

And that model, for some reason, the 34 billion one, seemed it was more like the existing open-source LLM’s in terms of safety. So, it was kind of more like Falcon or more like Dolly 2.0. And so it seems like you’ve held back a model, I’m guessing, and you don’t need to confirm on air, but that it seems like because it didn’t meet the security standards of the other three, which is an interesting thing to have happen because presumably the same process was followed for all of them.

Thomas Scialom: 00:20:23

Yeah, that’s absolutely correct what you said. And that’s one of the main reasons we didn’t release it. One thing also, it’s probably that we don’t know, we didn’t have the time to investigate. What people have to understand is that just the process together, starting from the pre-train model to fine-tune it to apply RLHF, reinforcement learning, to then evaluate it automatically, then evaluate it with human at details and with red teamers, which are expert at finding the failure, trying to make the model say something bad and they put the model in the hardest possible ways to make it say some stuff. All this process takes a lot of time. And so we just decided based on this bad point, which we don’t know yet why we didn’t have the time to investigate. Maybe it’s an error in the evaluation, maybe it’s a model that was not well fine-tuned. I don’t know exactly yet. But we just said, “Okay, why wasting one, two, three more weeks just for that? We can already raise a smaller model. The biggest model, the more capable, let’s not wait to let everyone use it.”

Jon Krohn: 00:21:34

That makes perfect sense. And so it’s kind of nice to have that confirmed, because that’s actually what I speculated on here earlier. So, great. So, you mentioned that there were two main complaints. One of them was that it was too safe, so people were complaining that Llama 2 is too safe. So, things like somebody saying I want to kill this process leads to it saying “I can’t kill, killing is bad.” What was the other big complaint that people have had since the release?

Thomas Scialom: 00:22:00

Tell me if you heard the same, but from my perspective it was safety, too safe, and code. Bad code abilities.

Jon Krohn: 00:22:07

Oh yeah. So, I do say that in my episode 702 as well is that it seems like… So, when I say that Llama 2 performs at the state-of-the-art relative to any other open-source model, that’s on natural language tasks where it’s like natural language in and out. So, my understanding, and I haven’t tested this extensively myself, but where there’s code being generated or where you’re asking it to do kind of mathematical problems, my understanding is that it doesn’t perform as well as some other options out there.

Thomas Scialom: 00:22:44

Yeah. So, to that, we actually rushed so fast from LLaMA 1 to Llama 2 to get visibilities. We focused mainly on natural language and not code. I agree the model is not that good at code or mathematics for now, but we are working on that and, well, at the time of the podcast will be a released, I hope that some Code Llama will be also released.

Jon Krohn: 00:23:13

Oh, very cool. All right, that’s awesome. That’s exciting to hear. So, I mean that kind gives us a sense, it’s a really tantalizing glimpse. It’s possible that by the time this episode is out that will be old news. But yeah, a Code Llama, that sounds very cool. Is there anything else that you can tell us about where this research might be going? I understand, I don’t want to be extracting information under duress.

Thomas Scialom: 00:23:47

I mean we are the open guys. I mean in general there’s no clear secret. We’ll try to improve the models in general, which means scaling them, keep training them on motor currents, increasing the abilities, maybe tackle more multimodality codes. We just discussed that reasoning. We will try also to improve the RLHF stage to capabilities. We’ll go also on, one of the direction is obviously tools, teaching this model to use in zero-shot fashion, some tools maybe to access the web more easily. All those directions seems quite reasonable and expected, so there’s no big secret. Now the question is more like how we will do that. Will we make it some breakthrough discovery in the way that will enable us to largely improve? Hopefully yes.

Jon Krohn: 00:24:50

Nice. Yeah. And you mentioned there being able to handle tools, which is something that you have a lot of experience with because you’ve also been involved with the Toolformer LLM. So, this is an LLM that came out earlier and the Toolformer is specialized to decide which API to call in a circumstance, when to call the API and what arguments to pass, and how best to incorporate the results into the next token prediction of the generative model. So, maybe this is a good time to kind of switch over and talk about this Toolformer project since it sounds like future Llama iterations might incorporate some of that kind of capability.

Thomas Scialom: 00:25:38

Yeah. The Toolformer was connecting large language models with tools was an idea I had last summer a year ago. It fell to a natural extension of all these models, Retro, Atlas, RAG, where you augment with a retriever, a language model and the intuition is very easy. So, the idea was to train together a dense retriever and a language model so that you’ll augment the context. And so when you ask a question, you will search on all the training data some relevant passages. And so if the model didn’t remember, memorize well, it will boost the capabilities which was very efficient as shown in all those papers. But so this is what we call a non-parametric framework, because you rely not only on the parameters, the weights of the model, but also on external source of knowledge that could possibly grow to time to, for instance, incorporate new fresh information without necessarily retraining the model.

00:26:43

But that being said, my idea was to extend this to a non-parametric general framework where you could see, and there was some work at the time that was doing that, you could see how using a calculator or Python executor or different search engine, maybe I’m using Google for some search and Google Scholar for the specific search on papers. And so the idea was to just give a set of tools to the model and much more like a human-like way teach it to use them given the context, not at each [inaudible]. But so the model now has to know when to use the tool, how to use it to benefit from this performance. And so Toolformer, Timo Schick led this work and we published it in February and I think it was also very pleasant timing. It was two months after ChatGPT and everyone was kind of, “Well, the game is over, ChatGPT, Agile is there, what’s next?”

00:27:48

But ChatGPT at the time was just limited to a window like you’re chatting with an agent that has no access to the world and that changed a lot the perception that you can have once you can give the LM the access to the world to some knowledge, it makes the experience for the user completely different. It extends the capabilities dramatically. And so that’s what we have done with Toolformer with some self-supervised techniques. So, the model learned that basically itself when it increases the perplexity using it all. So, that was the main idea.

Jon Krohn: 00:28:26

And so this may be familiar in an analogous way and you can tell me where maybe the analogy breaks down, but having not used Toolformer myself yet, it seems to me to be similar to what later happened with ChatGPT with the plugins so that now with ChatGPT, you can turn on third party plugins. So, if you turn on the WolframAlpha plugin, then when you ask ChatGPT to do a calculus problem, it’s going to bring in WolframAlpha to use that API as opposed to trying to use next token prediction to do math, which works surprisingly well in a lot of circumstances given that it’s mind-boggling that this next token prediction can often do math correctly.

00:29:21

But you’re basically guaranteed a correct answer, a correct differentiation, for example, if you use WolframAlpha to do it. So, ChatGPT will automatically detect, okay, this is a circumstance where I should be using WolframAlpha, let’s do some math with that or it can access the web, like you said, it can do a web search or it can plug into websites like Kayak to book you a trip and to find you the car rental and book the hotel. So, is that the analogous use case Toolformer? Toolformer is obviously open-source.

Thomas Scialom: 00:30:01

Yeah, I mean I think the idea was there. I saw a lot on Twitter when one month late after Toolformer, OpenAI user plugins. So, they actually site in the plugin page Toolformer and some people say, “OpenAI reimplemented Toolformer in one month.” Honestly and humbly I think the idea was in the air and we had a good timing [inaudible]. I think also the method used by OpenAI was quite different from Toolformer. So, that’s quite interesting. In Toolformer the idea was, so we had access to bad… I mean at the time language model compared to GPT-3, at least. It was before Llama. And so what we did is with the self-supervised method, which works kind of well, but my conclusion also at the end of the work was we need more capable base model and fine-tune align model such that we learn to use tool with some instruction following scheme, which is also why I stepped back from Toolformer at the time and not extended the project to work on Llama 2 and making it working with instruction tuning to follow the instruction of the users.

00:31:16

And actually you have one paragraph at end in the discussion analysis, the paper, showing kind of emergence of tool use where you just with a prompt describe, you tell to the model basically natural language, you can use a calculator, use this format. For the API, use a search engine, use this format. Now I don’t remember which one it was in the paper, but what’s the difference in height between the Eiffel Tower and Empire State Building, and then naturally say step one, search height of the Empire State Building, search height of the Eiffel Tower and then calculate the difference between the two. So, you can see how from Toolformer where there’s the idea of using the tools, but the method is pretty efficient but yet I would say is obsolete with a better line model we move to Llama 2 to now maybe come back to Toolformer.

Jon Krohn: 00:32:08

Right, right, right, right. Makes perfect sense.

00:32:52

This episode is brought to you by Grafbase. Grafbase is the easiest way to unify, extend and cache all your data sources via a single GraphQL API deployed to the edge closest to your web and mobile users. Grafbase also makes it effortless to turn OpenAPI or MongoDB sources into GraphQL APIs. Not only that but the Grafbase command-line interface lets you build locally, and when deployed, each Git branch automatically creates a preview deployment API for easy testing and collaboration. That sure sounds great to me. Check Grafbase out yourself by signing up for a free account at grafbase.com, that’s g-r-a-f-b-a-s-e dot com.

00:32:56

It’s exciting how these different research threads diverge together and it kind of sounds like you had that vision all along that you’re like, “Okay, cool. Toolformer works really well, but it could be better if the base model that was calling it was better. And so let’s focus on this Llama 2 project for a while and then come back and worry about this API calling from Llama 2 later on.” Very cool. Looking forward to that. And that’s similarly for the kinds of things, again with my own machine learning company, that kind of ability having these really powerful models like Llama 2 with open-source API calling abilities built in, this is huge for us as well because it means that there’s all kinds of cool things that we can do internally.

00:33:49

Like a lot of companies, we use APIs, these kinds of microservices to make it easy to have these different compartmentalized services within the platform. And so with something like Toolformer, we can then be able to say our users could provide natural language instructions, just have a natural language chat with our platform and all of the capabilities of the platform, the large language model behind the scenes can say, “Okay, I think that they’re asking for this particular kind of data or this particular kind of task to be done and we have an API for that, so let’s go use it.” And then the results are returned back in exactly the kind of format like a JSON format that our platform was expecting. It can make the API call successfully, it can return the information from that call and present it to the user. It’s a very cool thing to be able to do.

00:34:54

I mean it sounds like with the level of worry, the level of concern that went into making sure that Llama 2 is used ethically, something like Toolformer, maybe this kind of ties into even AGI concerns because people say, “Oh, AGI won’t be that dangerous because it’s not going to be connected to the world.” But that’s obviously not true, because with projects like Toolformer, we see that no, they will be connected to the world. In my company we’re using something like Toolformer to be able to query software APIs and get information back, but there’s no reason why those couldn’t be connected to hardware, why these couldn’t impact the real world. So, I just wonder if you have any thoughts on that and maybe we can have a bigger AGI discussion later in the episode.

Thomas Scialom: 00:35:51

Sure. But no, maybe [inaudible]. I think those are very good points and actually we take safety for the tool direction very seriously. That makes the thing quite different from a kind of closed LLM in a window with just chat in demo. There’s real risks at another order of monitor measured. So, for sure there’s new concerns, new research questions and problems on the way that makes it very serious.

Jon Krohn: 00:36:25

Nice. Okay, well, yeah, that’s a clear answer.

Thomas Scialom: 00:36:29

There’s a survey on augmented large language model we published also in February just after Toolformer. We have a section at the end of that saying like augmented language models, augmentation of notary tools where a model can now take an action in the world. This is a different story than before.

Jon Krohn: 00:36:49

Nice. Yeah, yeah, yeah, no doubt. So, in addition to Toolformer, another LLM project that you were working on before Llama 2 was Galactica. And so Galactica was a large language model that is I suppose specifically designed for handling academic research, scientific papers and these kinds of scientific questions. The Galactica model was only live for a few days I guess. So, I don’t know, it seemed like a really big deal and then it was taken offline. So, maybe tell us a bit about the project and maybe the thinking behind bringing it down and maybe whether it will be back in the future.

Thomas Scialom: 00:37:34

Yeah. So, there’s this website which is one of the most well known for researchers called Papers With Code, a company that was acquired by Meta. And so the project of the team was, that was kind of visionary but large language model. They wanted a large language model for science that will help us to access information for science, to help us develop creative writing for science, maybe connect different ideas for science stuff, find some papers that you will never find on Google Scholar just based on the ID. And that’s what large language model are capable of and that’s what Galactica was about. And actually that was one of the first open large language model that works pretty well. And it was in some aspects far ahead of its time and some aspects we made probably some mistakes on the way. It was only a pre-train model, not an instructed model. And so maybe we presented it way too much as something that can answer questions, do things, and it’ll have worked so well after an instruction tuning phase.

00:38:55

The second thing also probably we did not well was to overclaim a bit on the webpage saying it can write a paper, and I can understand how for scientific person working in the science this will feel like overclaiming. That was not our purpose, but anyway, because of all the noise, and that was quite some noise at the time on Twitter and stuff, we decided to remove it. It was also a weird time because at that time there was a lot of people still criticizing large language models that were quite noisy on Twitter. And on top of that, some people from the scientific community that say large language are dangerous for science, et cetera. And it was just two weeks before ChatGPT release, so that was an interesting timing. I think for instance people don’t realize how good it was at citations.

00:40:03

I fine-tuned it myself to give you an examples of following instructions and when you say, “Cite a paper about bias,” it will find the papers. For instance, to give you an example of maybe more that will speak more, Chinchilla, the scaling laws, I think Chinchilla doesn’t appear in the title of the paper, or scaling law doesn’t appear, one or the other, I don’t remember. And so just saying, “What’s a citation for Chinchilla?” Which is not in the name of the title, it will find you, write it and you could just click and add it to a [inaudible] when you are writing scientific paper. So, it was kind of connecting the things like that. And from the test we did, it was outperforming some of the Scholar or Elastic search engines. And I think as search engine LLMs have not been yet well explored, but that’s something big.

Jon Krohn: 00:41:02

For sure.