Podcasts SDS 695: NLP with Transformers, feat. Hugging Face’s Lewis Tunstall

98 minutes
Artificial Intelligence, Data Science

SDS 695: NLP with Transformers, feat. Hugging Face’s Lewis Tunstall

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn

What are transformers in AI, and how do they help developers to run LLMs efficiently and accurately? This is a key question in this week’s episode, where Hugging Face’s ML Engineer Lewis Tunstall sits down with SDS host Jon Krohn to discuss encoders and decoders, and the importance of continuing to foster democratic environments like GitHub for creating open-source models.

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Lewis Tunstall

Lewis Tunstall is a Machine Learning Engineer in the research team at Hugging Face, and is the co-author of the recent bestseller “NLP with Transformers” book. He has previously built machine learning-powered applications for start-ups and enterprises in the domains of natural language processing, topological data analysis, and time series. His current work focuses on democratizing reinforcement learning from human feedback by providing tools and artifacts for the open-source community. He holds a PhD in Theoretical Physics from the University of Adelaide, was a 2010 Fulbright Scholar and has held research positions in Australia, the USA, and Switzerland.

Overview

This episode is all about transformers, and to introduce the topic, Lewis Tunstall treated us to a little history lesson. In 2017, Google wanted to optimize their machine translation capabilities. To that point, translation tasks were carried out using a long short-term memory (LSTM) network. These networks may have worked well for certain applications, but the main problem was that they couldn’t be scaled, making them slow and limited in functionality. Part of the answer to the team’s question came in attention mechanisms, which encode context through an additional layer in a neural network. Attention mechanisms help the algorithm to model the next word in a sequence more accurately. Early transformers that facilitated this had an “encoder-decoder architecture”, in which the encoder takes an input sequence (a prompt) and converts the individual tokens of the prompt (in this case, words and their meanings) into a sequence of embedding layers. The decoder then uses this information to predict the next token.

This wasn’t the end of the story. As Lewis says, the tech teams at Google noticed that the most crucial component of the transformer architecture is the decoding side, so they sought a way to remove the encoding part, thereby getting the model to predict the next word in a sequence rather than converting sequences with an encoder. They found their solution in relay-transformers, and, as Lewis points out, the story even gets a surprise threequel with Google’s BERT, an encoder transformer that did precisely the opposite to relay-transformers, throwing away the decoder to instead focus on getting rich representations of NLP sequences, which improved text classification enormously.

Lewis is keen to note that knowing the intricacies of transformers isn’t just a ‘nice-to-know’. Whether you’re training or pre-training transformers, obstacles will be part and parcel of the process. If you know the underlying computations of a network, then it becomes much easier to debug the problem. Even more, Lewis says that understanding the fundamentals of transformers will help engineers to improve and extend them to complete increasingly sophisticated tasks.

Jon and Lewis also discuss the potential of running large language models (LLMs) in real-world applications. Lewis takes an example from his latest book, saying that applying a concept called “knowledge distillation” means engineers can take information from a large but slow model and distill it into a smaller, more efficient one. This results in improved performance with a slight hit to accuracy that nevertheless can be worth the trade-off, depending on the problem. Lewis outlines two additional methods for running LLMs in the podcast, which can also be found in his book.

Listen to the episode, to hear about parameter-efficient fine-tuning techniques, the techniques of knowledge distillation, and a practical way to get started with Hugging Face libraries.

In this episode you will learn:

What a transformer is, and why it is so important for NLP [04:34]
Different types of transformers and how they vary [11:39]
Why it’s necessary to know how a transformer works [31:52]
Hugging Face’s role in the application of transformers [57:10]
Lewis Tunstall’s experience of working at Hugging Face [1:02:08]
How and where to start with Hugging Face libraries [1:18:27]
The necessity to democratize ML models in the future [1:25:25]

Items mentioned in this podcast:

AWS Insiders Podcast
Paradot.AI
WithFeeling.AI
Modelbit
Hugging Face
Natural Language Processing with Transformers by Lewis Tunstall, Leandro von Werra and Thomas Wolf
Hugging Face Transformers GitHub
Hugging Face Text Generation Inference GitHub
Hugging Face Sentence Transformers GitHub
Attention Is All You Need
Movement Pruning: Adaptive Sparsity by Fine-Tuning
Rouge Score
QLoRA GitHub
SDS 670: LLaMA: GPT-3 performance, 10x smaller
SDS 674: Parameter-Efficient Fine-Tuning of LLMs using LoRA (Low-Rank Adaptation)
SDS 693: YOLO-NAS: The State of the Art in Machine Vision, with Harpreet Sahota
The Making of the Atomic Bomb by Richard Rhodes
Deep Learning Illustrated by Jon Krohn

Follow Lewis:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 695 with Dr. Lewis Tunstall, Machine Learning Engineer at Hugging Face. Today’s episode is brought to you by the AWS Insiders podcast, by WithFeeling.ai, the company bringing humanity into AI, and by Modelbit for deploying models in seconds.

00:00:22

Welcome to the SuperDataScience podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:53

Welcome back to the SuperDataScience podcast. Today I have the great honor of being joined by the brilliant Lewis Tunstall. Dr. Tunstall is an ML engineer at Hugging Face, one of the most important companies in data science today because they provide much of the most critical infrastructure for AI through open-source projects such as their ubiquitous Transformers library, which has a staggering 100,000 stars on GitHub. Lewis is a member of Hugging Face’s prestigious research team, where he is currently focused on bringing us closer to having an open-source equivalent of ChatGPT by building tools that support RLHF, which is Reinforcement Learning from Human Feedback. And he’s also big into large-scale model evaluation. On top of all that, Lewis was the first author of the book, Natural Language Processing with Transformers, an exceptional bestselling book that was published by O’Reilly last year and covers how to train and deploy large language models using open-source libraries.

00:01:47

Prior to Hugging Face, he was an academic at the University of Bern in Switzerland and held data science roles at several Swiss firms. He holds a PhD in theoretical and mathematical physics from Adelaide in Australia. Today’s episode is definitely on the technical side, so will appeal most to folks like data scientists and ML engineers. But as usual, I made an effort to break down the technical concepts Lewis covered so that anyone who’s keen to be aware of the cutting edge in natural language processing can follow along. In this episode, Lewis Details what transformers are, why transformers have become the default model architecture in NLP in just a few years. How to train NLP models when you have few to no labeled data available. How to optimize LLMs for speed when deploying them into production. How you can optimally leverage the open-source Hugging Face ecosystem, including their Transformers library and their hub for ML models and data. How RLHF aligns LLMs with the output users would like, and how open-source efforts could soon meet or surpass the capabilities of commercial LLMs like ChatGPT. Exciting. All right, you ready for this freaking fantastic episode? Let’s go.

00:02:54

Lewis, welcome to this SuperDataScience podcast. Delightful to have you here. Where are you calling in from?

Lewis Tunstall: 00:03:01

Thanks for having me, Jon. I’m calling from Switzerland.

Jon Krohn: 00:03:05

Nice. Yes, and by coincidence, the way that I managed to to wrangle you into coming on this podcast was on a recent trip that I had while I was flying to Switzerland. I was on the plane reading a book called Natural Language Processing with Transformers, and the first author on that book is you Louis Tunstall. And so I was reading it on the plane and shortly after I landed, I was filming a podcast episode with a guest Richmond Alake, who was on in episode number 685, and he has a podcast himself. At the end of the episode, I said, Richmond, do you have any great podcast guests that you would recommend? And his first recommendation was you. And I was like, that’s crazy because I’m currently reading his book. I absolutely love it. It’s obviously super topical. Everyone wants to hear about NLP with transformers these days, so I’d love to have him on air. Richmond made an introduction and now you’re here. Thank you so much.

Lewis Tunstall: 00:04:06

Yeah, thanks a lot. It’s a small world really. I also met Richmond very randomly, I think one day he just messaged me saying, “Hey, I have a podcast, you want to come on?” And, you know, it is just these things in life, you know, connections that happen kind of very you know, organically.

Jon Krohn: 00:04:21

Yeah. Well, thank you for taking the time for me and for him and for the listeners of all the podcasts out there that you, that you educate, we’ve got a super educational outline plan for today, and we’re gonna start right with transformers. So, Lewis, what is a transformer and why is it such a big deal for natural language processing in particular?

Lewis Tunstall: 00:04:41

Great question. So, maybe we can break it down in, into a couple of steps. So, at the high level, the transformer is just a a neural network, and in particular it’s a deep neural network. So you’ve probably heard of deep learning kind of taking over software in the world in the last few years. And it was developed by researchers at Google around 2017 who were trying to find a more efficient way to do machine translation. And up until that moment, the sort of standard way of doing any sort of machine translation task was basically using a type of network called an LSTM. And these LSTMs, they have this kind of recurrent structure, which means that basically, you want to convert one sentence into another. And so you feed in the kind of words from say the English sentence, and then this network would kind of iteratively process those words to then generate the translation.

00:05:36

And these neural networks, they, they worked these LSTMs they worked quite well. But they had a few issues. And, and the major issue was that no one at the time could figure out how to kind of scale them, which means how could you increase the size of the neural network in terms of parameters and also how could you train on, you know, massive corporate. And so there were a few ideas floating in the literature at the time. Probably the most prominent one was something called attention mechanisms. And these attention mechanisms were also designed for machine translation, where the idea was when we’re trying to process language, how can we encode some of the context that is surrounding the words in some phrase. So, an example might be like, if I say my name is Lewis and I come from Australia, then we sort of can imagine there’s some kind of correlation or some relationship between the word Lewis and the word Australia. There’s, there’s some sort of connection between those two words in that sentence.

00:06:36

And what this attention mechanism did, essentially it’s a layer in neural network. It provided a way to essentially teach neural networks how to model those relationships in a fairly efficient way. And so what the researchers at Google said was like, okay, maybe we can just take this attention idea and just train a network based entirely on this, with a few other, you know, tricks and things that are, that were common in the literature. And the result they found was, first of all, a machine translation system that was state-of-the-art at the time. But more importantly it was something that could basically be parallelized in GPUs. So, instead of having to use a kind of recurrent structure where you have to process sequences kind of word by word, you could basically feed in the full sequence and then this attention mechanism would compute all these correlations, which would then allow the sort of models would be scaled up.

00:07:31

And at the time this was, this was already like a big deal for machine translation. But then researchers at OpenAI they took this idea one step further and they said, well, maybe we can actually just do this for sort of just general text generation. So, instead of just having a single task like machine translation, what if we just train a transformer that’s just very, very good at modeling the next word in a sequence. And this was the start of what was called GPT or the Generative Pre-trained Transformer. And in many ways that marks the kind of start of this like revolution in transformers where people eventually, you know, saw, okay, this is both very good at generating text. And then as you scale up to the size of the internet and also to, you know, hundreds of billions of parameters, now you get these kind of models today like, you know, GPT-4 and ChatGPT.

Jon Krohn: 00:08:21

Yeah. Really cool explanation there. I liked how you transitioned from LSTMs handling the sequential data to these transformers that have this attention mechanism that are able to take in the entire sequence at once. It’s interesting how today I’m aware of a few different research strands that are now trying to blend these two kinds of approaches. Because one of the big downsides of the transformer approach is that the larger the context window, so if you want to handle twice as many tokens, so roughly you can think of them as words. If you want to handle twice as many in your input, because the transformer needs to attend to that entire sequence, it vastly increases the amount of compute. So, where LSTMs, because they work sequentially as your sequence gets longer, the compute scales linearly. But with transformers, it scales polynomially. So like if you are, if you expand your context window by x your the amount of compute required goes up by x squared. So very, very quickly. Way, way, way more computers required if you have these bigger context windows. And so I’m aware of these research threads where people are trying to find some way of kind of blending the way that LSTMs worked with transformers, so that we can get the attention on the full context window without necessarily that polynomial explosion in compute.

Lewis Tunstall: 00:09:53

Yeah, that’s totally right. And I think a lot of people kind of declared LSTMs were dead, you know, with transformers and I’m trying to remember the name of this, this latest model. It’s got like a funny acronym, like I think like AKVMW or something like this. But it basically it does exactly this, it tries to blend the kind of best of both worlds. So, how do you have essentially infinite context length, but also have the parallelability? And from at least the demos that this set of researchers have provided, you can see that they are quite competitive with, you know, standard transformers today. So, things that are like, you know, LLaMA and stuff, you see some of these LSTM hybrids at the smaller scales are quite competitive. And yeah, who knows, maybe we’ll see in a few years that, you know, it’s not just a transformer that is, is the kind of key ingredient. But I think today most people kind of default to transformers just because the ecosystem has kind of become fairly commoditized. So, it’s now relatively easy to fine-tune transformers and relatively easy, getting easier now to pre-train transformers. And I think there’s a whole bunch of like, tools around that, which, you know, for these more research based projects, they often take a bit of time to kind of, you know, coalesce in the kind of general practitioner’s toolbox.

Jon Krohn: 00:11:18

Yeah. And we will dig into a lot of these tools, many of which you are involved in and Hugging Face company are involved in, the leaders really in making transformers accessible and easy to train. So, we’ll get into that in a moment. But before we get there, let’s dig some more into transformers. So, you have a whole chapter dedicated to transformer anatomy. So, maybe you can give us a high-end overview on these key transformer anatomy concepts like encoders, decoders, and then some of them are encoder-decoders. So, how do these different kinds of transformers vary and why would you use one in a particular scenario or another?

Lewis Tunstall: 00:11:57

That’s a great question. So the original transformer as I mentioned before, was trying to model basically machine translation. And in this task you’ve got some input sequence of text that you’re trying to translate into an output sequence. And so the actual original transformer is called this encoder-decoder architecture, where essentially you have an encoder, which is taking your input sequence. And the role of this encoder is to essentially convert all of these kind of like raw tokens. So basically bits of, bits of words and so on into a sequence of embeddings. And these embeddings are essentially like the sort of numerical representation associated with each token in your sequence. And then the decoder part of the transformer then takes that sequence of embeddings and then does, as the name suggests decoding, which basically says, okay, given that input sequence, now, how can I, for example, predict the next token in that sequence?

00:12:57

So, if you imagine that my input sentences, you know, “my name is Lewis” and I want to translate it to German, then the input to the decoder will be basically these embeddings of, “my name is Lewis”, and then the role is to now predict, given that input, that the first word now should be “mine”, so from German, so then “mine name ist Lewis”. And so that, that would be the kind of main distinction of these two components. And it actually goes back to four transformers. So people who are using like RNNs, there was a very common kind of sequence-to-sequence paper by Ilya Sutskever and others at Google. And that’s where they kind of pioneered this approach. And it’s very good at modeling these kind of like, you know, input sequence, output sequence kind of tasks.

00:13:43

And then basically the sort of two main sort of branches off that encoded-decoder. The first big one was the GPT model from OpenAI. And so what they did was they said, okay we’re really interested in generative tasks. And so for these, the more important part is the decoder, and maybe we can basically save some compute by just throwing away the encoder part of the original transformer. And then we just get the model to predict the next word in the sequence. And we don’t have to worry so much about, you know, this kind of sequence-to-sequence mapping. And that obviously turned out to be a very impactful branch or a type of transformer. And these transformers are called decoder-only transformers. And then the other side of this was when Google a few months later basically released BERT.

00:14:34

And BERT was the sort of first encoder-type transformer, where they did the opposite thing. So, they threw away the decoder, and then they said, let’s just focus on getting very good and rich representations of NLP, you know, sequences. And what that model can do very well is it can basically handle tasks where you’re trying to extract information. So, for example let’s say you’re doing a text classification. So, the representations that come from BERT or these encoder models are very good. You can do question answering. You can do named entity recognition. These, kind of like core NLP tasks, typically encoders do, do, do well in them. Whereas the decoder ones are typically, you know, where you want to do things like summarization or chat or, you know, things like that today.

00:15:27

Now the boundaries are blurring because that was the story when we wrote the book. But then there were other models that came out. So, for example, T5 is a model from Google researchers, where they showed that you can actually frame most NLP tasks as a sequence-to-sequence task. So, if you ask for example how do I classify a movie review? Then the kind of conventional approach would be, okay, take your transformer encoder feed in the review, you now get some sort of embeddings from that. You can then sort of look at those embeddings and say, okay can I measure the sentiment associated with that input? Whereas the T5 architecture is this encoder-decoder? And instead what they say is, well, you can formulate every task, like you can say, “classify the following review as positive or negative”, and then you put the review and then the decoder now has to output, you know, positive or negative as a word. And this model is far more versatile because you can now do many tasks with just the same architecture. But kind of traditionally I would say the field has typically, you know, split into this encoded decoder branch. And so the, these T5 models are widely used, but I, in at least what I’ve seen in practice, people tend to kind of fixate on one or the other.

Jon Krohn: 00:16:48

Nice. Yeah, that all makes perfect sense to me. Yeah.

00:17:39

This episode is supported by the AWS Insiders podcast: a fast-paced, entertaining and insightful look behind the scenes of cloud computing, particularly Amazon Web Services. I checked out the AWS Insiders show myself, and enjoyed the animated interactions between seasoned AWS expert Rahul (he’s managed over 45,000 AWS instances in his career) and his counterpart Hilary, a charismatic journalist-turned-entrepreneur. Their episodes highlight the stories of challenges, breakthroughs, and cloud computing’s vast potential that are shared by their remarkable guests, resulting in both a captivating and informative experience. To check them out yourself, Search for AWS Insiders in your podcast player. We’ll also include a link in the show notes. My thanks to AWS Insiders for their support.

00:17:42

So, the so the encoder-decoder structure was the original concept, and I imagine that’s the Attention Is All You Need paper.

Lewis Tunstall: 00:17:51

That’s right.

Jon Krohn: 00:17:51

And so with that original transformer paper, they, I think quite naturally, it makes a lot of sense to think, okay – and it follows along with this concept that we, we’ve had in deep learning for a longer period of time, this autoencoder structure where you are taking some kind of information. So, in this case, it’s strings of tokens, strings of kind of words. But this idea of encoding information into a, an into an abstract space is something we’ve been doing with all different kinds of data types with autoencoders for years. So, you know, it could be an image or a video it could be a sound wave, and you can encode it from the raw input information. So, pixels in the case of an image and convert that into this abstract representation where provided enough training data, that abstract representation is consistent regardless of the specific pixels.

00:18:57

So, you could have, you know, the encoded representation could be like, you know, this is a brown dog by a red fire hydrant or whatever. And it’s abstractly represented, it’s not like written in language like that. It’s based on a location in this high-dimensional space. But we can go from, you know, you could have one image of a brown dog around a red fire hydrant, and the pixels could be completely different. Like there’s no relationship between these two images of that same scene, but the encoded representations could be very similar. And then the decoder structure takes that, that abstract representation, and can return it back into the pixelled version. And so yeah, that kind of idea of going from encoder to decoder, I can see how that’s where transformers started because it makes a lot of sense conceptually.

00:19:52

It’s surprising to me, and I still have a hard time really wrapping my head around how GPT architectures in particular work with the decoder only. Like, because for me, it’s so sensible to think about that intermediate step where we have that encoded representation. And so it’s, yeah, there’s still this, there’s a bit that I still need to wrap my head around with these decoder-only structures that specialize in natural language generation, like the GPT family most recently GPT-4, and the other architectures that we have behind the ChatGPT models. And yeah, because they have this decoder only, they end up specializing in being able to predict the next word in a sequence very well. Whereas as you highlighted there, the encoder-only structures like BERT specialize in tasks that don’t require that kind of generation. So, more of a natural language understanding as opposed to natural language generation.

00:20:52

And yeah, so as you said, there’s that natural language understanding. We create that abstract representation from the raw, natural language, and then that abstract representation can be used downstream for all manner of tasks. Yeah, so you gave lots of examples there, question answering, named entity recognition, text classification. And yeah, so to me, conceptually also the BERT thing, the encoding, the encoder-only also makes a lot of sense. I’m like, cool, yeah. We go from a string of characters to this abstract representation, and then we can do things with those abstract representations. We can compare the similarity and yeah, it allows for, you know, fast semantic retrieval of information, that kind of thing. Anyway, I think I’ve, I’ve now been speaking for a very long time, and I can kind of tell that you’re, that you’re ready to go.

Lewis Tunstall: 00:21:42

Yeah, sure. I mean, I would say the thing that you mentioned about the decoder is predicting the next word or the next token in a sequence. It it’s a surprisingly hard task, right? So, if, if you, if you just give any human a random piece of text from the internet and say, you know, what is the next word in that sequence, I think you’ll have a hard time getting, you know, very good performance on that. And so my understanding of these decoder models is that typically because this task is, is rather hard when it’s done at scale the models kind of pick up enough like, let’s say surface level capabilities around, you know, grammar and all the like linguistic things that we do as humans online, that then when you want to do a task like, okay sentiment analysis or I don’t know, write me a recipe for scrambled eggs they’ve kind of seen enough examples where that kind of generation itself is relatively straightforward. Of course the hard part is that if you try to do things that are very out of domain I think you typically find that’s where, you know, not just these decoder models, but most of these models, they, they tend to struggle. So, they’re very impactful. But they still, you know, today have some, some fairly, you know, serious limitations.

Jon Krohn: 00:23:04

Yep. Nice summary there. And that does actually really conceptually, I think you might, you might have just cracked it for me. That was a really elegant explanation as to how these generators, yeah, just because they are specialized in this next word generation that’s what the model weights are structured to be able to do. We don’t need that intermediate abstract representation. We can just skip right to predicting what the next word should be, as tricky as that can be. And so-

Lewis Tunstall: 00:23:31

Maybe just one comment to make is that the models that come from this are these so-called pre-trained models. And these, these models, they’re kind of like very sophisticated auto-complete. But if you play with ChatGPT, it’s clearly, you know, much more than auto-complete. And so there is another kind of whole secret sauce of ingredients around reinforcement learning and how do you model like human preferences and things. And that’s kind of machinery that’s like sort of tacked on top of this, like, you know, predicting the next word. So, even though kind of mechanically we do say the model is predicting the next word in a sequence for these very impressive models that there’s, there’s a fair bit more, you know, happening. But we can talk about that later.

Jon Krohn: 00:24:14

Nice. Yeah, we will talk about that later for sure. RLHF really exciting topic. So yeah, before we get there with respect to these with these kinds of tasks that transformers can perform in your book you specifically highlight feature extraction as something that transformers are really great at. So, what is feature extraction and how do transformers differ from traditional ways that we might have extracted features in natural language processing?

Lewis Tunstall: 00:24:51

Yeah, sure. That, that, that particular part of the book was I think born from the experience Leandro and I had as working as data scientists at Swiss companies at the time. And you know, a lot of the time as a data scientist, you want to try the next fancy thing and you go, oh, I want, I want shiny new toys. But then almost immediately, you know, your, your manager will be like, well, we’ve got no label data or we’ve got very little label data or something like this. And so then doing this whole, like, fine-tuning process tends to be a bit of a struggle. And so we showed in the book essentially how you can extract these embeddings from transformer models. And the idea here was to say that essentially, you know, the conventional way that people did this kind of pre-transformer time was to take a model like word2vec or, you know, some extension of this where you essentially had kinda like universal representations for like every word in the vocabulary.

00:25:50

So, for example, the word like dog, you know, it was just one vector or one representation that you could use to build, you know, your kind of features that you would then build, say, a classifier on top of. And obviously these transformers they have this contextual representations, which means that the representation of dog will actually depend on the surrounding words in the sequence. And so when you do feature extraction using transformers, you get this kind of nice sort of representation of these embeddings, which pick up that contextual information, and then you can use those embeddings for downstream tasks. For example, we do text classification in the book, but as you mentioned before, a very common one is doing things like, you know, semantic search. So, if I want to embed all of the documents in my company, I can feed them through a transformer, I get vectors, and then I can compare, you know, which documents are semantically similar to each other.

00:26:49

Now, what we did do in the book was sort of like the vanilla thing, which was like, take BERT and just feed you know in this case it was like a motion tweets through it to see, you know, what are the kind of representation of these tweets according to their emotion. But there are better models for example, sentence transformers. They have a special kind of training process where it’s essentially Siamese network of two transformers, kind of like learning how to model the semantic similarity of documents. And so if you ever want to actually do feature extraction for things like search and stuff, you, you’re much better off using these like special sentence transformers than, you know, just an off-the-shelf BERT.

Jon Krohn: 00:27:33

Very cool sentence transformers, I’ll be, I’ll do my best to remember to include a link to those in the show notes. That sounds super useful for people doing these kinds of yeah, these kinds of applications where you’re, you’re interested. So, this kind of, I guess this builds on the conversation we were having earlier with architectures like BERT being encoder-only and converting things into that [inaudible 00:27:57] space. We now have specialized approaches like sentence transformers that are even better for getting those yeah, those abstract representations well aligned. Very cool. And this idea of the token dog being represented in an abstract, high dimensional space as opposed to as like a one hard encoded word that’s just, that’s kind of yeah in this, in the traditional way of doing natural language processing, that word dog, you might have needed like a taxonomy to say okay, like, you know, dog is related to cat in this way, they’re all like in the animal family. And now with transformers, we can have this totally data driven approach where we don’t need to be maintaining all these manual taxonomies. And it has way more flexibility because, as you say, when we come across the word dog in a sentence, like, I ate a hot dog with relish, it doesn’t consider that dog to be in any way related to a cat.

00:29:00

And so, yeah, it’s, yeah, I read about this a lot in in my, my book Deep Learning Illustrated came out many years before yours, so I didn’t have anything about transformers, but even in that area, working with LSTMs and approaches like word2vec or doc2vec (document to vector) we all, you know, that’s a big point that I make in my book is that you’re gonna get way better results using deep learning in this data-driven approach as opposed to trying to manually hard code meaning in natural language processing. And yeah, there’s so many benefits, obviously, in terms of human time on tasks and just in terms of quality, like, it’s just, it ends up working way better as we, as probably most of our listeners have now seen with tools like ChatGPT.

Lewis Tunstall: 00:29:53

Yeah, I totally agree. And it’s funny to think that I think we’re old enough to have been the generation who lived, you know, pre and post transformers. So, you know, I remember doing NLP in the, in the ancient days where you had to think about stemming and you know, how, how you actually pre-processed your data. Like if you extract, you know, punctuation and stuff and you have all these nightmares and you, you’re not quite sure if it’s gonna work. And then when you suddenly have a transformer and you just say, well, I just tokenize it. And more or less, you know, for most tasks the fine-tuning will work. That was for me, quite a big, you know, update to my way of working.

Jon Krohn: 00:31:09

The future of AI shouldn’t be just about productivity. An AI agent with a capacity to grow alongside you long-term could become a companion that supports your emotional well-being. Paradot, an AI companion app developed by WithFeeling AI, reimagines the way humans interact with AI today. Using their proprietary Large Language Models, Paradot A.I. agents store your likes and dislikes in a long-term memory system, enabling them to recall important details about you and incorporate those details into dialog with you without LLMs’ typical context-window limitations. Explore what the future of human-A.I. interactions could be like this very day by downloading the Paradot app via the Apple App Store or Google Play, or by visiting paradot.ai on the web.

00:31:12

Yeah, exactly. the meatiest chapter of my book was a chapter on all these NLP pre-processing techniques that you needed to go through. And it was actually funny when the book was being copy edited, when the copy editor finished that one, she was like, oh my goodness. I was like, such a, like, such a crazy journey. It was so complicated. It’s such a long chapter and now, yeah, it’s just like, it’s probably just some like one-liner that I can do with the Hugging Face transformers library I don’t need to worry about at all. Cool. So these kinds of conversations that we’re having about what’s going on inside a transformer model even more broadly within a deep learning architecture, why should a practitioner care or should a practitioner care? Like, why does somebody need to understand how a transformer works, Lewis? if they’re working on NLP problems?

Lewis Tunstall: 00:32:14

Yes. I think it’s a bit of a philosophical question because in some sense it depends a little bit on, you know how, how deep and curious you want to go into, into a topic. So I would say at a very technical level if you’re training transformers, so whether you’re fine tuning them or, or especially if you’re pre-training them at some point you’re gonna hit some errors and, and those errors are gonna be, maybe the data is not set up right, maybe, you know, you have things on the wrong CUDA device, all this annoying stuff. And when you start looking at the stack trace, you’re gonna see some like you know, lines of code. It’s gonna say, “Hey, in modeling_BERT on this line, in this attention layer, there’s a problem.” And at least for me personally, having an understanding of how the computations are running in the network, it helps you kind of iterate faster through and debug things much quicker.

00:33:11

And that’s more just like the practical side of things, but then at the sort of like, let’s say more fundamental level, it’s like any other piece of knowledge, right? Like if you are trying to build something, it’s really, really useful if you know how like, the things you’re building with, work, because not only does it help you, as I said before, debug stuff, but it also helps you think about how you can extend them. Because if you sort of never go lower than just the sort of high-level API transformers, you may encounter some tasks in your work where you need to do more sophisticated things, like, for example, you know, blend different types of heads on the transformer for like multitask training, and then that’s gonna get to a point where it’s gonna be very, very useful to have, have a good understanding.

00:33:57

So, I would say that those are roughly the sort of two main things I would suggest. And to be honest, in general, it’s just fun, right? It, it’s good to, at least for me, intellectually, it’s fun to know, you know, how, how these things, you know, work. And I do recommend everyone just once in their life implement a very simple transformer like we do in the book just in the same way that, you know, everyone has to implement backprop once in their life. This is, I think, you know, the next, the next step.

Jon Krohn: 00:34:24

Nice. Yeah, I agree with you on both your points for debugging as well as for building creatively. It’s the second one that I in particular think is valuable if the more that a data scientist can dig into the underlying fundamentals like linear algebra and the partial derivative calculus, the probability theory that underlie, that underlies machine learning including deep learning, which is a kind of machine learning and transformers, which is a spec specialized deep learning architecture, like fundamentally under the hood, you have these relatively simple mathematical operations going on. And if you understand those things that are going on under the hood, it allows you to have way more flexibility and creativity and the ways that you can be solving problems. So, like you gave that example there about lending attention, heads for multitask problem, for training a multitask architecture, and there’s, there’s an unlimited way.

00:35:21

Like, it’s like, who was giving this an analogy recently? I think it might have been Harpreet Sahota who was in episode number 693. So, just a week ago that episode was released. And in that episode he talks about like Lego blocks, I think he has, he has young kids. And so you know, when you, when you understand what all of these different blocks are it allows you to have an effectively infinite amount of flexibility and how those blocks are combined and the things you can do with it. And I find, I see with my team with at my company Nebula on, on our data science team, when we’re trying to, in particular productionize what we’re doing in order to have that work efficiently for the specific use case that we have at the platform almost every time there isn’t some, like, there’s, there’s lots of tools out there that allow you to, with one line of code productionize your model from a Jupyter Notebook or whatever. And those tools are great, and they’re really amazing, but they also only work in a relatively narrow set of circumstances. There’s all kinds of situations that we encounter regularly in production, and then I imagine lots of companies do, where there is no turnkey approach, you’re going to need something unique that nobody has ever done before in order to have a performance experience, real-time experience for your users that blends together all of the kind of the backend things that are going on.

00:37:00

And the yeah, so, yeah, I you know, I’m a huge evangelist, I guess, for understanding the building blocks so much so that regularl listeners will know I have this Machine Learning Foundation series that is mostly available on YouTube already. And there’s a GitHub repo where all the code is available that covers linear algebra, calculus, probability theory, algorithms and data structures, statistics. Because yeah, I think it’s so important, it’s so fundamental to know these fundamentals.

Lewis Tunstall: 00:37:30

Totally agree.

Jon Krohn: 00:37:31

Nice. All right. So, thank you for letting me talk so much.

Lewis Tunstall: 00:37:33

No worries. It’s interesting. Good stuff.

Jon Krohn: 00:37:37

So, all right, let’s move on to another topic from your book. So, one of the challenges that we encounter as data scientists, particularly when we’re working with large amounts of data like we want in transformer architectures, you already mentioned this kind of earlier when you were talking about how you and Leandro were trying to come up with feature extractors and or, or trying to come up with some model and your manager would say, oh, but we don’t have any labeled data. This is super common in natural language processing, that we have access to large amounts of data, like you just did a scrape of all of the internet. But we don’t have any labels for those data. So, we just have the sequences. We don’t have, you know, this for, for whatever task, you know, it could be some classifier task where a common example is sentiment. So, you know, is this string of, is, is this tweet or is this movie review a positive review, or is it a negative review? You know, we might have access to billions of tokens, but maybe only a few hundred of these labels or maybe none of these labels. So, I know this is something that you talk about, talk a lot, a lot about in your book. Would you mind sharing some of your favorite strategies for tackling this common NLP problem?

Lewis Tunstall: 00:39:07

Sure. And this was a, I think born out of pain, basically. I mean, there’s a pain that you have when you’re a data scientist, you know, trying to solve a business problem with little labeled data. So, I’ll tell you the way we kind did it in the book, and then I’ll tell you a little bit about what’s changed since the, I think the advent of, you know, ChatGPT and GPT-4, which for me personally have kind of made me rethink a little bit how, how I would tackle this. So the first one is if you’ve got like no labeled data, and this can be relatively common, especially for extractive tasks like named entity recognition or question answering. Because the price of labeling the data is quite, quite a lot. There, there aren’t a huge amount of tools available yet for tackling those kind of tasks.

00:39:59

And for, there, you might be better off just going for a generative model like GPT-4 or ChatGPT and saying, “Hey, here are a few examples of what I’m trying to do.” So, this is so-called like few-shot prompting, “can you please, you know, now complete the final task?” And these generative models are, are quite good at following those types of instructions. But if you’re doing something that is more like say text classification then there’s far more tools available for this. So, for example, in the Transformers library, we have zero-shot pipelines, or zero-shot classification pipelines. And what these pipelines do is they basically formulate the classification task as a sort of what’s called a an NLI task or a Natural Language Inference task where you take the context, which is the thing you’re trying to classify. You have a sentence that is like a template to say, you know, is this positive or negative? And then you get the model to fill in the third part of that, of that sequence. And this personally has always been like a good baseline. You just run this, it’s like two lines of code. You run it over your data set, it gives you a rough idea of where you are.

00:41:12

But then if you want to go beyond that. Probably the sort of two approaches I would recommend. One is called SetFit. It’s a technique I developed with researchers at Intel. And this is called Sentence Transformer or few-shot learning for sentence transformers. And essentially we showed that you can classify documents across different domains with usually around 8 to, you know, 16 examples per class. And you get results that are, are fairly comparable to training on the full dataset. And this, as I mentioned, it works well for text classification.

00:41:53

But if you want to go beyond that, then there are these other techniques called parameter-efficient-fine-tuning techniques. And here the idea is to use a transformer like T5, which I discussed briefly, is kind of a general-purpose transformer that can solve many tasks. And then you try to basically prompt it in a certain way so that, you know, if you’ve only got a couple of labeled examples, you can still solve the question-answering task in a fairly efficient way. And so I would say those are like the sort of two main things I would recommend. But today, seriously using these large language models is also, you know, if you don’t have security concerns with your data because of your company just testing the API is often a good start, I would say.

Jon Krohn: 00:42:45

Deploying machine learning models into production doesn’t need to require hours of engineering effort or complex home-grown solutions. In fact, data scientists may now not need engineering help at all! With Modelbit, you deploy ML models into production with one line of code. Simply call modelbit.deploy() in your notebook and Modelbit will deploy your model, with all its dependencies, to production in as little as 10 seconds. Models can then be called as a REST endpoint in your product, or from your warehouse as a SQL function. Very cool. Try it for free today at modelbit.com, that’s M-O-D-E-L-B-I-T.com

00:43:24

Yeah, I couldn’t agree with you more on all of the solutions that you suggested. The one that I didn’t know of the ones that you’ve mentioned is SetFit. I just quickly looked it up and looks super cool, so I’ll be sure to include a link to the SetFit GitHub repo. Yeah, looks like a really cool way to be using few-shot learning without needing to prompt yourself in order to classify sentences. And then the parameter-efficient-fine-tuning PEFT is something I’ve talked about on the show a fair bit. I have an episode dedicated to the low-rank adaptation LoRA method for doing that in episode number 674.

Lewis Tunstall: 00:44:02

Very cool.

Jon Krohn: 00:44:05

And but yeah, it is, it is crazy how you can be using a tool like GPT-4 and API like GPT-4, particularly for the complicated, like they were using GPT-3.5 prior to March of this year. There were all manner of tasks that I was like, wow, it would be amazing if I could just ask a model to do this. And, you know, GPT-3.5 might be able to do it a portion of the time, but not with an accuracy that was high enough that I could be confident about using those data for training a model. But then with the GPT-4 overnight with its release in March, I was like, “Oh, let’s try some of those use cases again.” And it nails it every single time. But yeah, as you say, there’s, there’s potentially reasons why you might not want to use GPT-4, so your company might not be comfortable with sending those data off.

00:45:00

And then also the OpenAI terms of service do not allow you to be using GPT-4 to create a competitor to GPT-4. So, it depends on exactly, you know, if, if you’re not gonna be creating a chatbot with the data that you label, then it’s probably fine. I’m not a lawyer. This is not legal advice. But yeah, there’s, there’s also, there’s and we’re gonna talk about this more in the episode later on, but there’s, we’re getting really powerful open-source alternatives to GPT-4 emerging every week, as you and I were talking about before we started recording this episode every week, it seems just some major release of an open-source approach that gets closer and closer to being as good as GPT-4. And some of those don’t have commercial use constraints. And so it might be by the time this episode is live. It might be the case that you can be using a completely open-source commercial use model that’s as good as GPT-4, and then you can be running it on your own infrastructure. You don’t need to be worrying about sending proprietary data off to a third party but you, and you might get comparable results.

00:46:21

So, really, really, it, yeah. Something that you and I were touching on also before starting recording is just that with how quickly things are moving and these capabilities that are emerging from so many people like yourself, getting so deep into the open-source opportunities here and releasing these models, these capabilities for all of us, it’s an unprecedentedly exciting time for me in my career as a data scientist.

Lewis Tunstall: 00:46:51

Oh, that’s cool to hear. I just had one more thing where you might want to use an open-source model. So I was playing with ChatGPT the other day and I wanted to see if I could use it as a writing assistant. And so I started just taking some passages of text from the George R. Martin’s Game of Thrones. And I just asked it, you know, can you like complete you know, this part of the text to, you know, or rephrase it. And because Game of Thrones is so gory and you know, violent and all that stuff it just refused. It said no, like as a language model, I will not engage in blah blah blah. And I think what this shows is that for these like sort of next frontier models, which are, have a lot of this so-called alignment built into them it’s great to have that for like general purpose chatbots, but if you want like a very domain specific thing you probably want to have something that is more adapted to your data or your, or the things that you’re interested in.

00:47:56

And so I can imagine a future where you have like these kind of very powerful capable systems like from OpenAI and others. But then companies use a lot of open-source models then to just do the more domain-specific stuff where, you know, for the reasons you mentioned about data leaving, but also the use case itself may not be supported, you know, just through the API.

Jon Krohn: 00:48:18

Right. When you need more violent language.

Lewis Tunstall: 00:48:20

Yeah, sorry, that’s maybe not the best example. But you know, if, you know, the thing is want George R. Martin to finish his book, right? I’ve been waiting for Game of Thrones for years now and I just want him to finish the last one. So, you know, if he’s listening he should just use ChatGPT or he [crosstalk 00:48:34]

Jon Krohn: 00:48:35

No, it’s a good example. Like there are, the folks at OpenAI spent six months from when the GPT-4 architecture was trained to put barriers around it in terms of security. And you know, they, I I think they’ve done an exemplary job. In retrospect, I think it’s amazing that they spent those six months. Because I imagine that there would’ve been, this is completely speculative, no one has ever said anything to me to suggest this, but I just speculate that, you know, in a big organization like that, there were probably some people that were like, this is safe enough, we’ve got to get this out. This is crazy. But then, you know, some other factions were able to be like, no, like there’s still these really dangerous use cases that we need to handle. We need to do more testing before this goes out.

00:49:24

But it does mean that, you know, there are all kinds of perfectly legitimate classic books like you’re saying, like the Game of Thrones series, you know, people love that series. But it has a level of violence that is completely fine to buy in a commercial product. Yet, you know, the folks at OpenAI have decided, you know, to have these safeguards that mean that level of violence that’s okay in a commercial product that you can buy as a book is not acceptable in their particular tool. And so I, you know, there are probably violent use cases that we don’t want a chatbot to be able to do under any circumstances, but being able to generate, you know, a fantasy novel prose, maybe shouldn’t be one of them.

Lewis Tunstall: 00:50:10

Yeah, totally.

Jon Krohn: 00:50:12

All right, so Louis, once we’ve created our George R. R. Martin bot, that can generate violent prose when we want to make model efficient in production that’s something that’s a huge challenge with transformers and I know that it’s something that you tackle in your book. So, could you outline for our listeners the kinds of practical things that we can do to take these very large language models and have them be useful in real-time in production for, say, a user of a platform that has an LLM running in the background?

Lewis Tunstall: 00:50:54

Yeah, sure. I would say also the kind of techniques I mentioned they often depend very much on the use case and also the size of the model. So there are things that kind of work okay for smallish models but then they just don’t really apply at larger, at larger scales. And in the book we kind of cover, I would say three main topics or areas. One is called knowledge distillation and this is an old idea goes back to Jeff Hinton and, and others before him, where the idea is that you’ve got a capable model. So, this might be something that is like, let’s call it the teacher and this model is too big to deploy efficiently. Maybe you want to deploy something on the edge or, you know in a cheaper way. And so this knowledge distillation technique allows you to basically take essentially information from this teacher model and kind of imbue it in a much smaller, more efficient model. And when it works, you typically get comparable performance. You take a small amount of a hit in your say, accuracy. But often the trade-off is worth it because, you know, in real-life situations, accuracy isn’t just the only metric you’re worried about latency, you’re worried about cost and things like that. So this works quite well for models in the sort of a hundred bit, a hundred million parameter range. So things like BERT works well for small GPT models, it works okay. But no one has kind of figured out how to crack this effectively at the very large scales of like, you know, tens of billions of parameters.

00:52:27

And that’s why, for example, we haven’t yet seen, as far as I know, something analogous to like, you know, distill LLaMA 65B or something like that. So, that will often get you roughly maybe a two x reduction in latency. You can usually compress your model about half. And then there are other techniques which we discussed. So, the most common one that’s used for, for many use cases is called quantization. And the basic idea here is to take the precision of the weights that the model was trained in and just cast them to a lower precision. So, typically things like 8-bit or 4-bit is now the sort of new standard. And this then, because you’ve now got lower bits, you can basically, you know, do your map models or matrix multiplication faster and use and use less memory. And, you know, we can talk there, there are a bunch of different quantization strategies we can talk about.

00:53:26

But the sort of other element we mentioned is this idea of pruning. So, here the basic idea is how can you kind of delete weights in the network together but still preserve the sort of overall performance of the model. And when we wrote the book, the sort of current state-of-the-art was a technique developed at Hugging Face called Movement Pruning. Where you, basically, it’s a pruning technique designed specifically for fine-tuning transformers. But all of the kind of hardware that, that sort of like the sort of consumer hardware that existed basically didn’t really help you because even though you delete all these weights, you need to save them as sparse matrices. And then these sparse matrices, they, they don’t really get any bit-boost on like standard like, you know, intel hardware. So we kind of concluded that pruning at the time wasn’t quite mature enough to be used in production. But my colleagues at Intel have said that, you know, they’ve now got some quite impressive approaches where you do get genuine sparsity and, you know, fast latency. So, those are the three main techniques and we can kind of dive deeper if you, if you want.

Jon Krohn: 00:54:39

Yeah, yeah. Model distillation, quantization, pruning. Yeah, these are the three that come to mind for me as well in terms of production deployments. Great examples that you gave there and cool to have you break down for us so clearly the kinds of circumstances where one of these approaches might work well versus another. I think I will actually leave that topic there and not dig too much deeper because there’s still so many more things that I want to get into while we’re recording,

Lewis Tunstall: 00:55:08

But maybe I could mention one thing specifically about like, generative models. So these like kind of techniques I mentioned, they’re very like, kind of generic and you can usually apply them, whether it’s an encoder or a decoder, it doesn’t really matter. But one of the big bottlenecks, when you’re doing chatbots, is having a fast response, right? So, if I ask a question like, you know, what’s the weather like today? I don’t want to wait a minute to get my, you know, text back. And there’s been a lot of cool innovation around streaming tokens. So, this idea of like sending the user, like kind of like, you know, bit by bit the answer. And that’s what you see in ChatGPT, right? Like, you don’t have to wait and then get the full answer. You can see the answer being kind of generated on the fly. And one of my colleagues at Hugging Face called Olivier, he built this very, very cool server called Text Generation Inference, which not only does this token streaming, but it does really impressive optimizations of the transformer architecture. So, you can do things like, you can fuse operations in the transformer to basically run faster on certain CUDA kernels. And you can do like cool things to do with like basically how you shard the model across different GPUs. So if you’re doing any sort of generative text task as far as I know today, this is like the kind of current state-of-the-art for deployment.

Jon Krohn: 00:56:32

Nice. Yeah. What was the name of that library or approach? One more time.

Lewis Tunstall: 00:56:35

Text Generation Inference.

Jon Krohn: 00:56:37

Wow. Super cool. Lewis, I had not heard of the Hugging Face LLM Text Generation Inference library before, but I will definitely be checking that out because it sounds like exactly what I need for a lot of use cases at Nebula with our production deployments. Thank you so much for sharing that. And it ties in perfectly into the next topic that I wanted to cover, which is the Hugging Face ecosystem in general and all of these open-source tools that Hugging Face releases for the public. So, what is the Hugging Face ecosystem and what role does it play in the practical application of transformers in NLP? So, you’ve obviously given us a bit of a taste here already with the Text Generation Inference library, but that only scratches the surface. I mean, the Hugging Face Transformers library is fundamental to this entire movement.

Lewis Tunstall: 00:57:33

That’s right. Yeah. And, and just I think last week it crossed a hundred thousand GitHub stars. So, that was a pretty nice milestone. I think it’s like one of the first machine learning libraries to hit that. And as you said, right, like in the origins of Hugging Face for those who actually don’t know, Hugging Face started out as a chatbot company building a chatbot for like teenagers.And then Tom Wolf and Vitor Sanh, they, they saw this transformer release from Google BERT, and they were like, okay, we need to put this in PyTorch because, you know, TensorFlow is at, you know, not, not what we want to program in. And so they did a fast port of that PyTorch and then it just exploded. So, you know, it, I think it coincided almost at the perfect time where the community was very quickly getting excited about PyTorch, and then they had seen, you know, the performance of BERT and now they could just run this themselves.

00:58:32

And of course, the first thing that you face when you’re trying to build such a library is like, okay, where do you get models from? And in those days, a lot of models were basically shared on Google Drive or on GitHub. And the challenge that you have is that as the field moves very fast how do you kind of synthesize all of these different pre-trained models? And this kind of like gave birth to this idea of the Hugging Face Hub which originally started off as, as like a model hub where basically you had the pre-trained weights of BERT and you know, other transformers that followed it.

00:59:09

And then you had a very sort of nice integration between the transformers library and the Hugging Face Hub. So, you could basically pull models from the hub, run them locally on your machine, and you could also then, you know, push your train models back to the hub so that, you know, you didn’t have to again, share them with your colleagues with a Google Drive link. You could just say, “Hey, check out my model, you can now test it yourself.” And in machine learning, right, models are kind of often the focus of attention, but in reality there’s like a much wider range of things that you have to worry about. So, of course it’s like data, like where do you get your training data from and how do you kind of curate that data? And so this eventually the Hub kind of expanded in scope to now host data sets. So, we now have several tens of thousands of data sets. And these data sets are contributed primarily through the community.

01:00:00

So, we have some very cool ones like, you know, the classic ones from NLP, but also you’ve got nowadays not just NLP but like many modalities. So, we have vision data sets, we have time series data sets NLP of course. And what’s cool about this is you get this kind of this like, kind of like ecosystem building where people go, oh, now I can take a data set from the hub. I can take a pre-train model, I can train a new model on that combination. I can push that model back to the hub, and then other people can build on top of that or they can feed it into their demos. And so the sort of ecosystem today is broadly speaking, a collection of open-source libraries built with a kind of layer of the hub. So, the hub is basically tightly integrated to all of these open-source libraries. And the kind of mission of, of what we have at Hugging Face is to basically provide these tools to the community so they can then go and build, you know, cool companies, cool products and, you know, we have our own paid services on top of this, but kind of in the core of the company, it’s fundamentally open-source.

Jon Krohn: 01:01:05

Very cool. That was a great breakdown. And there were some details in there that I wasn’t aware of, particularly with respect to the initial history of where the Hugging Face hub emerged from. However I did know about the hundred thousand GitHub stars and yeah, it is, there’s only a few machine learning libraries like PyTorch and TensorFlow that have that many.

Lewis Tunstall: 01:01:25

Yeah, exactly.

Jon Krohn: 01:01:25

So very, very cool. Yeah, we’re really grateful for your work. Stretching back, even pre-pandemic, I don’t know the exact year, but I know it was pre-pandemic because I was still working in an office, which I haven’t since. And so a colleague of mine, Grant Beyleveld, he was saying, so I guess it was around 2018, 2019, he was just marveling at all the cool things that Hugging Face was doing. He was like, this is the coolest machine learning company in the world. So, awesome that you work there. It must be an amazing atmosphere, yeah. So specifically you are a Machine Learning Engineer at Hugging Face. What does that mean? How does that intersect with the kind of stuff that you write about in your book and what are some of the exciting projects that you’re working on?

Lewis Tunstall: 01:02:20

Yeah, so at Hugging Face, the roles that we have are quite broad in scope. So, even though formally I’m an engineer, I’ve also done a lot of work on education. So, previously I worked on a course for transformers that we offer to the community. And also nowadays what I focus more on is, is typically around the sort of research side of things. So, how can we build tools and artifacts around this domain of RLHF, which we mentioned earlier? So, I would say depending on which kind of branch of the company you’re working on a day in the life can, can look a little bit differently, but more or less we all collaborate over our open-source repos. And this can range from, you know, building features for, for libraries like transformers or, you know, just patching bug fixes and so on.

01:03:12

But the kind of core goal, in general, is to always try and pick the sort of most impactful kind of projects to work on. And as a result, this means that you have to be very reactive to what’s happening externally to the company. So, for example, when Stable Diffusion landed my colleagues at Hugging Face, they very, very quickly had this diffusers library you know, having the integration of this, of this model from Atability. And when I say quickly, I’m talking on the scale of days to weeks. So, you know, it’s like you have to do very, very fast to keep up with what the community’s doing. And we see that also today with large language models. You know, as we said before, we have all these new models landing. So, within the sort of transformer side of things, you need to be able to kind of quickly decide whether to integrate it into the core library or not.

01:04:09

So that, that’s sort of in broad terms, what, what I do specifically today, as I mentioned before, is, is more around trying to figure out if this reinforcement learning stuff actually works. So, we kind of have I would say a few existence proofs from OpenAI and Anthropic that it does. And, you know, talking to ChatGPT gives you a sense that it does work. But we haven’t yet seen in the community a very clear end-to-end example showing that not only does reinforcement learning kind of work in a technical sense, but it actually makes for a better model that is, you know, more aligned with human preferences. And there have been a few attempts to do this, but the kind of I would say the conclusions have always been a bit murky because the evaluation of these systems is very complex. And so kind of primarily what I’m looking at now is, is in this aspect of training and evaluating these you know, more complex beasts.

Jon Krohn: 01:05:12

Yeah. Could you break down for us a bit more this RLHF concept? So, you know, what is it, what’s involved, what data are needed? You know, why did people try this at all in the first place?

Lewis Tunstall: 01:05:24

Sure. So, again, OpenAI were the pioneers here. And they, they actually built towards ChatGPT in several kind of impressive papers. So, their kind of first foray in this direction was learning to summarize. And what they were interested in was, we know that language models especially generative models are good at, you know, generating summaries, but people often complain that these summaries aren’t very good. So, when, when you try to measure, you know, how good is a summary, you have some kind of automatic metrics like the ROUGE score which try to measure kind of the overlap of your, of your summary with a kind of reference summary. But generally speaking, people had always kind of recognized summarization models weren’t great. So, what they did instead was they said, well, why don’t we get the model to generate some summaries and we show those summaries to humans, and then we’ll get the humans to rate which of the summaries is best.

01:06:28

And so the idea was that instead of trying to use some metric like ROUGE, which always has some, you know, limitations, the thing we really care about is people reading summaries. So, let’s just teach the model, pardon me, how to, how to learn that. And so the recipe is relatively simple on paper. Basically, you take your summarization model, you generate some summaries, you show them to humans, they label them, and then you train a second model, which is basically a classifier. So this is called a reward model. And this classifier is basically learning how to distinguish good and bad summaries. And then what you do is you take those two pieces and you do a third step, which is where the reinforcement learning comes in. And essentially what you’re trying to do is you’re trying to optimize the model to produce better summaries. And reinforcement learning essentially has a, has a loop where you can essentially generate some summaries from the model. Your reward model will basically rank them and say, okay, that’s a good summary, that’s a bad summary. And that gives you essentially a signal to sort of update the weights of the model in a direction that is more aligned with whatever the reward model is, is telling you.

01:07:39

And if you do this kind of three steps, what they showed in their paper is that the resulting models basically were preferred much more by humans for summaries than, you know, the baseline. And that’s kind of like the recipe that most people today are, are trying to follow. But now at much larger scales, and not just for one task for summarization, but also for, you know, multiple tasks. And the modern version of that recipe is that instead of having just summarization data, you now try to collect a large amount of what’s called instruction data. So, these are things like, “write me a recipe for an omelet,” “give me 10 things to do in Paris.” All these kind of very creative tasks that we have as humans, or, you know, “how do I write Python code for X?” And you train a model that is able to follow those instructions, but this model will always have this kind of problem that it may, you know, produce outputs that are a bit, you know, problematic or it just veers off in the wrong direction. And so you do those that, again, that human preference step, the reinforcement learning step. And then if everything works, you should get something like, you know, ChatGPT. But no one has quite succeeded yet. And I think that’s where there’s a bit of an arms race at the moment in the open-source community to see who is like first doing that.

Jon Krohn: 01:08:55

Yeah. Very cool. So, I’ll quickly try to summarize back to you what RLHF is or kind of paraphrase it, and then let’s dig right into that exciting arms race. So, the idea with this Reinforcement Learning from Human Feedback at a high level is that humans providing, so probably most of our listeners, and if you haven’t, you have got to use ChatGPT. A study actually recently came out that something only like 15% of Americans have used ChatGPT, hopefully in the data science community, it’s above 90%. And if you’re listening to this right now, and you haven’t used ChatGPT yet, you’ve got to, or maybe, so if you’re using like the GPT-4 API, but haven’t used the ChatGPT interface, I will forgive you. But in the ChatGPT interface, you have the opportunity after every single output that you get from the model, you can give it a thumbs up or a thumbs down.

01:09:51

And so that thumbs up or thumbs down can then be used as training data for this RLHF. And the loose outlined the steps as to how this happens in more detail. But the summary point is that it allows the model to have outputs that are more aligned with the kind of thing that you would like to see. So, going way back to earlier in our conversation, this means that these state-of-the-art generative models like GPT-4 are more than just a sophisticated auto-complete because there’s this additional layer, at least this one, maybe even more that we don’t know about, layer of sophistication. That means that the outputs are more, just more like what you expect in a conversation maybe with another human, or maybe not even with another human, but just the kind of output that you want when you provide the kind of input that you do.

01:10:55

And because of the, how popular ChatGPT is, there’s a huge amount of this training data, presumably, that allows OpenAI to be building a moat around what they’ve done. And we have, however, seen a lot of open-source groups. So, there are lots of open-source models that have come out in recent months that have built on things like the LLaMA architecture that I talked about back in episode 670. And so doing things like taking that LLaMA architecture, which was like just a sophisticated auto-complete, and then using instruction fine-tuning afterward using open-source versions of these kinds of thumbs up, thumbs down human data in order to fine-tune LLaMA to be able to be more like GPT-4. So, some of these architectures are like Alpaca, Vicuna is one that is really popular. And there are ones that also have complete commercial use terms. So, things like GPT-4All-j is completely suitable for commercial use. But anyway, so these, the main point is that with RLHF yeah, we get way better models and it’s really cool that there’s folks out there trying to, with the relatively limited open-source data relative to probably what someone like OpenAI has doing the best that we can to be approximating the way that GPT-4 performs. And that brings me to my next question, Lewis, which is that as we talked about these are really exciting times, we have thoughts of people like you, like everyone at Hugging Face and thousands of other people around the world are racing to be building open-source tools that are as good as GPT-4. Maybe it’s even conceivable that, and this isn’t actually something that I’ve thought out loud before, so I’d love to hear your input on this. Maybe it’s even conceivable that the next big breakthrough in these conversational agents or in generative AI or in machine learning, in general, will be open-source as opposed to coming from a commercial entity like OpenAI.

Lewis Tunstall: 01:13:20

Yeah, I think that’s, that’s definitely possible. And we already see a wide variety of sort of directions that the community has taken to tackle some of the engineering challenges. So, for example, we talked briefly about LoRA well, this Low-Rank Adaptation Methods. This is kind of the, the driving strand at the moment in all of these instruction fine-tuning experiments that the community’s doing. Because for example, if you want to try to fine-tune LLaMA 65B, so 65 billion parameters, you’re gonna need several hundred gigs of GPU memory. And for the average person, right, that’s kind of out of reach. And just recently Tim Dettmers and his collaborators, he’s a really impressive PhD student at Washington. They wrote a paper called QLoRA, so this was like quantized LoRA. And they showed that, you know, with a 4-bit quantization you can run and even train you know, LLaMA 65B on a kind of, you know, consumer-grade GPU. And I think those kind of innovations are things that you wouldn’t see from a private company because it would be your competitive advantage, right? Why would you share that, that knowledge with community? And it just shows that when you’ve got a tough problem, which is like, how do you train large models with limited you know, resources, people get very creative.

01:14:47

The other thing that I think has been quite interesting is the evaluation of these models especially these chat models is kind of gradually growing in maturity. So, a lot of the early evaluation was done using something called basically Vicuna benchmark. So, the idea here was let’s get GPT-4 to write a bunch of questions for example, you know, “how do I solve this coding puzzle?” And then you give that question to the models that you’re interested in rating, and then, you know, you get GPT-4 to act as a judge and then kind of compare which model is better than the other. And in the early days this, you know, showed, “oh Vicua is like 90% as good as ChatGPT” according to that benchmark. But, most people who then interacted with Vicua versus ChatGPT can see a fairly big capability gap. I mean, you can see that Vicua can’t hold conversations over many turns effectively, ChatGPT can do things, for example, you just dump a stack trace into it and it will then debug it for you like unprompted.

01:16:01

And so the, these models will were lacking in certain areas. And the community has now kind of realized that a lot of these things are often evaluating the style. So, basically GPT-4 as an example, as a judge will often prefer outputs that are just very wordy. Because you know, ChatGPT is always like a kind of wordy chatbot rather than if they’re factually correct or not. And even humans fall for this. So, there’s a very nice paper from Berkeley where they essentially saw, they showed that, you know, even human evaluators would get tricked by essentially ChatGPT. And I think that’s like a general challenge today in the community is like, how do we know if the models are actually very good? And again, it’s something that I suspect OpenAI has cracked internally, but they’re things that of course, that’s your competitive advantage, right? So, the community is, is gonna make the innovation there.

01:16:57

And yeah, I mean, we can talk about other things. Like, I think one thing that’s kind of been an open question is do you even need reinforcement learning in the first place? And you know, this is, we know we have this kind of existence proof from OpenAI, but there are other kind of researchers who are sort of skeptical that you truly need reinforcement learning, which has its own finicky problems. And it’s kind of exciting to think that, you know, we already have a few candidate alternatives you know, on the archive, which you know, may prove to be more efficient and also simpler to achieve the same objective.

Jon Krohn: 01:17:36

Yeah, everyone’s always trying to squeeze out reinforcement learning. It’s like a few years ago it was like, oh, deep reinforcement learning is gonna be fundamental to artificial general intelligence. And then it’s kind of having this renaissance right now we’re like, okay, in order for us to have these approaches, these LLMs be really well aligned with the responses we want, we’re gonna need reinforcement learning. Finally, it’s you back and then you’re like, no, actually we might not need it. We might be able to use simpler approaches. And yeah, it seems like a lot of these instruction-tuning approaches, they’re just supervised learning. They don’t require any reinforcement learning. And yeah, I can personally vouch that we’re getting amazing results without reinforcement learning. So very cool. If people are listening out there that haven’t done open-source before and they want to get involved with it after they hear the kinds of cool things that you’re working on in particular, you know, they want to get involved with the hugging for the Hugging Face Transformers library or some other library, the PyTorch library how do you recommend they get started?

Lewis Tunstall: 01:18:46

Yeah, that’s a really great question. A common one that I often get. I would say there’s a few different ways you can contribute. It depends a bit on like your background. So, if you are already like a very proficient, say, PyTorch developer, then reading the transformers code source code is relatively, you know, straightforward. So, you can immediately, for example, pick up open issues on GitHub or look for open bugs that haven’t been tackled yet and work on those. But, for people who are a bit more like myself, so I started off being a non-coder. I was quite a late bloomer. I think I was like 28, 29 when I started learning how to code.

Jon Krohn: 01:19:26

Oh really?

Lewis Tunstall: 01:19:26

Yeah. Yeah. I’m like very, very late. And for me, for me, the thing was like, I was looking at this stuff and I’m like, I have no idea like how to contribute. And actually starting off with like, just trying to read the documentation and, and improve the docs was often the sort of gateway drug to then actually writing code. Because often when you’re trying to understand something, you realize, oh, there’s a gap in the way it’s explained. So, I would say those are like the sort of two main routes. One is like go through the docs, which is more like, kind of high level. And the other one is to sort of just pick up issues that are on GitHub.

01:19:59

But the kind of open-source landscape for machine learning is also more diverse than just code, right? So, some of the sort of like most impactful things that we’ve seen on the Hugging Face Hub have been from community members who, for example, created a translation of a popular dataset or they curated like their own dataset, which turned out to be, you know, very useful. So an example of this, right, is the Alpaca dataset, which was the kind of the dataset that sort of launched this whole revolution in like LLaMA instruction models. It was like three grad students at Stanford who basically used, you know ChatGPT to, or I think it was ChatGPT to generate a dataset of instructions and they train a model on that. And, you know, it was like $300 and I think probably a few days work. So, it’s kind of like that there are different ways you can contribute. And the other one that maybe is worth mentioning is we often have at Hugging Face a lot of events. So, we have hackathons where people, for example, can get access to things like Google TPUs and train, you know, very cool projects. And so if you want to be part of like the community itself, that’s another way of you know, getting, getting your hands dirty and seeing, you know, excitement.

Jon Krohn: 01:21:19

Very cool. Great tips for getting started in open-source reading and improving the docs, picking up GitHub issues, and things like data circulation. Very cool. All right Lewis, so I actually had a ton more questions that I could have gone over with you, but I also want to get to some of the audience questions that we had. So, when I posted a week before recording that, I was going to have you on the show, the post got an extreme amount of engagement. So, at the time of recording over 36,000 impressions, almost 400 reactions, 23 comments, 13 reposts. It’s crazy. And some of these questions are really cool. All right, so the first question that I’m gonna go over is from Sangeetha. So, she’s an NLP Engineer and she is interested in hearing about your views on the notion of synthesizing a dataset using an enterprise model, and then fine-tuning it on an open-source LLM. So, you and I, Lewis, we did talk about this earlier, but she mentions that your recent blog post on LLM evaluation was amazing. I wasn’t aware about that. I’m gonna have to try to make sure to include it in the show notes. And yeah, so I don’t know if there’s anything else that you want to add for Sangeetha on this concept, which yeah, we talked about a bit earlier, but you might have more to add for her.

Lewis Tunstall: 01:22:45

Yeah, so the basic process here is like you know, getting a very good instruction data set is quite a costly endeavor. Because if you use humans, you need to get people to sit down and come up with creative ideas of like, you know, “give me a recipe for pasta”, then actually give the recipe, right? So, you have this kind of very arduous task and you can pay companies to do that, but it will cost you quite a lot of money. So, the shortcut today that most people take is they say, well, let’s just try to derive this from GPT-4 or from ChatGPT. But as you mentioned earlier OpenAI have this kind of thing in the terms of service which say, you know, you can’t use these outputs from our models to, you know, train competitors to, you know our stuff.

01:23:34

So, even though I think, and I’m not a lawyer, so, of course, don’t take this advice, I don’t, I’m not sure how enforceable terms of service are really, like, I mean, who, who knows that has to be tested, I think, in court. But of course, if you’re a company, you don’t want to go near that. I think, I think that that’s, that’s too high risk to take today. So, the alternative I would suggest is to maybe see if some of the newer models like Falcon or LLaMA 65B can get you maybe half the way there. So, I’ve lately been prompting StarCoder, which is a kind of code-generative model. And it’s quite, quite okay at generating some of this synthetic data for coding applications. So I think it’s probably only maybe a few months away before we’re able to do something analogous to like the sort of ChatGPT generation using a permissive model. And then those issues will no longer be with us.

Jon Krohn: 01:24:39

Nice. Great answer. Thank you for elaborating some more for us, Lewis, on that point. The next question here comes from Murilo Gustineli, who is a data scientist at a firm called Insight. And so this also builds upon some of the conversations. It seems like you and I through our conversation, hit on a lot of the topics that are most interesting to the audience at large because this again, will build upon something that we already talked about. So, Murilo points out how Hugging Face has undoubtedly played a significant role in lowering barriers to open-source ml. With the emergence of LLMs and the increasing complexity and cost of deep learning models, how critical do you believe the continued democratization of ML models will be in the near future? So, yeah, what are the challenges and opportunities associated with having better democratization of ML models, particularly these really large ones? So, you already made a great point shortly before we started tackling these audience questions on techniques like QLoRA Quantized Low-Rank Adaptation. I’m not sure if you have anything else that you’d like to add.

Lewis Tunstall: 01:25:59

Yeah, I think probably for me personally, the biggest reason to try and make sure that we still have open models is that as we’ve seen in the last few months, the community or the collective intelligence of humanity is able to learn and, and discover a wide range of impressive cool things. So, QLoRA is one, but also this whole thing about evaluation and trying to deeply understand, you know, how these, how these language models actually work. A lot of this would be much, much harder if we only had an API from, you know, a small number of companies to work with. So, I think from just a purely like scientific perspective it’s really important that we, we’re able to continue making such models and releasing them.

01:26:51

But there are of course several risks and like one of the main risks I see is that at some point we will have a model that is, you know, fairly capable and someone’s gonna do something bad with it. I think that’s a, it’s a bit of an inevitability, and bad could be, you know some large misinformation campaign or even worse. And then the question will be then, you know, who bears responsibility for that? Like is it the organization that open-sourced the model? Is it the company that hosts the model? Is it the individual? And I feel like society doesn’t quite have yet the kind of mental model for dealing with that. And so what I suspect will probably happen is that techniques like RLHF will become progressively more important in the release of new open models because if you can sort of make some level of guarantee that, okay, this model has been, you know, has some guardrails then you’re at least able to sort of partially limit the downstream risks.

01:28:01

But I mean, you’ve probably heard this is like a super big topic in legal and political circles. Like there’s the, the, the congress hearings and the EU AI Act. And so I think all of this stuff is, is really being, you know, negotiated at the societal level. But I feel fundamentally that we would want to have a future with open models because it reminds me, you know, a lot of parallels to, you know, like science, like when, when I used to be a physicist, there were eras in the Cold War where people didn’t share any information and it’s kind of to the detriment of humanity to do so, and we’ll see how it plays out, but yeah, exciting times either way.

Jon Krohn: 01:28:46

Yeah, exciting times for sure. And there certainly are risks too. There’s pros and cons to these two different kinds of schools of thought on should companies like OpenAI be keeping every, keeping their secrets to themselves you know, maybe some government auditing body gets to access what they’re doing, but we don’t want any actor to be able to have access to a system that approaches artificial general intelligence or even a systems like we have today that could be used for misinformation. And then yeah, in the other camp, it’s this idea that if everything’s open-source, then you can really get in there and understand exactly what’s going on. Yeah, it’s complex, but it’s nice to see that, unlike some of the other issues that we’ve had in recent history with digital platforms things like social media feeds polarizing politics, that isn’t something that politicians got ahead of, and I think they’re kind of realizing the mistake there and trying to make sure that that same kind of issue doesn’t show. Well, I mean, some issues are, are going to show up, but we’re, there’s a lot of people in government, in commerce in yeah, and just in the open-source community that are trying to get ahead of these issues that we have in AI.

01:30:06

And so I’m personally optimistic that, well, it is inevitable that some bad things will happen, that the worst things hopefully won’t. And that even some of those, the bad things will be mitigated. So very cool. There were lots of other questions, but I’ve gone over all of them now and it seems like we tackled all of them. You know, or the main points of these questions throughout our episode-

Lewis Tunstall: 01:30:35

We’ve answered it.

Jon Krohn: 01:30:36

And then, so I’ve got to apologize to our listeners. I actually, I promised someone at the time of recording just yesterday on LinkedIn, there’s a listener named Jonathan Baun out there, and he said I just got to the end of your most recent episode, and at the end of that episode you mentioned a book giveaway. And he was like, I guess I’m too late. Because the way that I run these book giveaways is when the episode comes out, so, your episode, your episode Lewis will be out on a Tuesday morning New York time. And so I’ll make a post around 8:00 AM New York time announcing the episode, and I’ll say the first five people that respond that they’d like a copy of Louis’s book will get a copy generously from O’Reilly. So that is happening again. And I’m supposed to mention it at the beginning of episodes, obviously, so, that the eager listeners … anyway, I guess so I promised Jonathan Baun, he was like, ah, I just got to the end of this episode and you have O’Reilly deal. Like, I got to your post, but I’m a week late, so I’m sure all the books are given away, and in fact, they are. And so, apologies again, to those of you who are hearing this too late.

Lewis Tunstall: 01:32:10

On the other hand, it rewards the person who listens, right?

Jon Krohn: 01:32:14

Yeah. Who listens right to the end, right when the episode comes out. It does reward that, that behavior. So yes that, yeah, there’s a book giveaway again. Yeah. Thank you very much to O’Reilly for offering to, to, so yeah, the five people that write on my LinkedIn post announcing Lewis’s episode and I’ll mention it in that post as well. Yeah, we’ll get a book and it’s a fantastic book. I’ve got my copy here, and it’s been invaluable to me as, as my company Nebula has forayed more and more into generative AI, particularly with open-source approaches. So, thank you Lewis, and thank you Hugging Face for everything that you’ve done for us. Now, Lewis, before I let you go, I always ask for a book recommendation other than your own book. Do you have one for us?

Lewis Tunstall: 01:33:09

Yeah, so I’ve been reading, this is nothing to do with you know, transformers of machine learning. It’s called The Making of the Atomic Bomb. And-

Jon Krohn: 01:33:19

Maybe that’s more in common.

Lewis Tunstall: 01:33:21

Well, in fact, it was recommended to me by a friend who, who was saying, “Hey, you know, I’ve been thinking about existential risks and stuff you know, there are some parallels.” And what’s really interesting in the book is it’s a really an in-depth history of, you know, from like basically World War I, pre-World War I, all the way through to, to the bomb is like the extremes, amounts of government level coordination that was required first of all to build a technology, but then later to, to figure out how to regulate it. And I, I think the cool part of this is that there’s, at least we managed to sort of more or less survive that part and figure out how to kind of live in a world with like very, very, you know, scary weapons. And so I’m sort of still optimistic, like you said that you know, we’ll find a way through you know, the, the, the next few years of AI development. I think the book is very nice and if you like, you know, technical physics and stuff, it’s got a lot of stuff in there too. So, I recommended it.

Jon Krohn: 01:34:28

Nice. Cool recommendation. Lewis, thank you so much for being generous with your time today. I know we’ve run over on the allocated recording slots. I really appreciate it. It’s been a fascinating episode and I’ve learned a ton. I’m sure our audience has as well. Lewis, before I let you go, how can people follow you after this episode if they would like to hear more from you?

Lewis Tunstall: 01:34:52

Sure. So, these days I’m mostly on LinkedIn. So just look up Lewis Tunstall. You can see my face. There’s, I don’t think there’s too many of us on there. And until recently I used to be on Twitter _lewtun, so Lewtun and yeah, I, my phone broke and I got locked out and yeah, unfortunately, Elon seems to have fired all of the support stuff, so I can’t get back in, but one day I’ll get back in and then, you know, you might see me on Twitter again.

Jon Krohn: 01:35:23

Nice. All right. Good luck. Maybe someone at Twitter is listening and can fix this situation. Statistically speaking, there probably isn’t. Okay. All right. Lewis, thank you so much awesome to have you on the show. Amazing to be able to go full circle with this amazing book of yours that I’m reading. And yeah best of luck to you and maybe we can be catching up with you again someday in the future and hearing how your journey’s coming along.

Lewis Tunstall: 01:35:56

Excellent, Jon. It’s been a pleasure.

Jon Krohn: 01:36:03

Boom, what a sensational guest who made for a sensational episode. In today’s episode, Lewis filled us in on how transformers are all you need for state-of-the-art NLP models, how we can efficiently label data using few-shot prompts to the APIs of cutting-edge models like GPT-4, how we can distill quantize and/or prune LLMs to make them affordable and fast in production. How RLHF uses human label data to align LLM outputs with what users are hoping for, and how you can get involved in open-source yourself by improving GitHub documentation, resolving GitHub issues, or curating datasets. As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Lewis’s social media profiles, as well as my own social media profiles at www.superdatascience.com/695. That’s www.superdatascience.com/695. If you too would like to ask questions of future guests of the show, like several audience members did during today’s episode, then consider following me on LinkedIn or Twitter as that’s where I post who upcoming guests are and ask you to provide your inquiries for them. And if you enjoy this episode, nothing’s more valuable to me than if you take a few seconds to rate the show on your favorite podcasting app or give it a thumbs up on the SuperDataScience YouTube channel. And of course, if you have friends or colleagues that would love the show, let them know.

01:37:17

All right, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another fantastic episode for us today. For enabling that super team to create this free podcast for you we are deeply grateful to our sponsors. Please consider supporting the show by checking out our sponsors’ links, which you can find in the show notes. And finally, thanks of course to you for listening. I’m so grateful to have you tuning in and I hope I can continue to make episodes you love for years and years to come. Well, until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Podcasts SDS 695: NLP with Transformers, feat. Hugging Face’s Lewis Tunstall

SDS 695: NLP with Transformers, feat. Hugging Face’s Lewis Tunstall

Podcast Transcript

Share on

Related Podcasts

August 15, 2025

August 12, 2025

August 8, 2025

Podcasts SDS 695: NLP with Transformers, feat. Hugging Face’s Lewis Tunstall

Share

SDS 695: NLP with Transformers, feat. Hugging Face’s Lewis Tunstall

Podcast Transcript

Share on

Related Podcasts

August 15, 2025

SDS 914: Data Lakes 101 (and Why They’re Key for AI Models), with Oz Katz

August 12, 2025

SDS 913: LLM Pre-Training and Post-Training 101, with Julien Launay

August 8, 2025

SDS 912: In Case You Missed It in July 2025