SDS 687: Generative Deep Learning, with David Foster

Podcast Guest: David Foster

June 13, 2023

There is certainly no “latent space” in this terminology-rich episode on generative AI! Learn the elements of generative AI, from autoencoders to latent space, and hear what data scientist David Foster has to say about the potential for generative music as well as his stance on the debate about AI’s existential threat to humanity.

Thanks to our Sponsors:
Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
About David Foster
David Foster is a data scientist, entrepreneur, and educator specialising in AI applications within creative domains. He authored the acclaimed O’Reilly textbook Generative Deep Learning: Teaching Machines To Paint, Write, Compose and Play, with the second edition set for release in May 2023. As co-founder of Applied Data Science Partners (ADSP), he inspires and empowers organisations to harness the transformative power of data and AI. He holds an MA in Mathematics from Trinity College, Cambridge, an MSc in Operational Research from the University of Warwick and is a faculty member of the Machine Learning Institute, with a focus on the practical applications of AI and real-world problem-solving. His research interests include enhancing the transparency and interpretability of AI algorithms, and he has published literature on explainable machine learning within healthcare.
Overview
The greatest part of machine learning’s short history has concerned discriminative modeling. Its business applications are broad enough to turn a range of applications into viable products, from image classification to recommender systems. David Foster explains that discriminative modeling is more useful for predictive labeling than for creating diverse content, as is the case for generative AI. The ability to, say, classify dog breeds or recommend books – possible with discriminative modeling – is much easier than it is to produce an image of a dog breed (it stands to reason; this is also true for humans as well as AI!) Nevertheless, David recommends that those who are interested in developing AI models should start with discriminative modeling, because this is where the principles of generative AI still lie.
Music is a recurrent theme for Jon and David throughout the show—both host and guest are musicians, and they discuss the proliferation of generative music on Spotify in recent months. Despite this stream of content, music generation still has a long way to go to catch up with the vast text generation capabilities of chatbots like ChatGPT. Creating music is a complex problem, not least because of the multiple compositional elements (such as pitch and duration) you have to bear in mind in addition to generating a single stream of notes. Nevertheless, there are ways forward. David suggests utilizing the same principles from text modeling, such as coding the duration as its own “token” (more commonly used to refer to a unit of text, but in this case, “token” could denote musical elements).
This interview is a mine of information for anyone who wants to get to grips with the terminology of generative AI, from (variational) autoencoders, decoders and reward functions to diffusion, CLIP models and latent space. We won’t take the words out of David’s mouth by defining them all in the show notes, but the term “transformers” are worth explaining here with respect to the conversation about music tokens. Let’s say an AI tool is asked to complete a composition. Without transformers, a neural network will use its latent knowledge of prior tokens (here, musical elements) to reach the end of the musical section. What a transformer architecture does is it “pays attention” to what it deems the most important tokens in order to define where it should focus its efforts in generating content, thus helping the score maintain its compositional integrity. (As an aside, David explains all this in the show with a vibrant example involving pink elephants!)
David also weighs in on the debate surrounding the potential for misusing AI, as well as the existential threat that AI could pose to humanity. He says that his concerns lie in the immediate future and the proliferation of deepfakes. The ability to deceive and ease in spreading that deception on social media channels is dangerous for democracy, and David believes education will be necessary to ensure that social media users take care of what they trust and (re)share. David adds that strict regulation methods may not necessarily be the best way forward, noting that they can end up stifling innovation instead of coming up with attribution and detection methods to ensure AI-generated content is monitored responsibly.
Much of what David discusses with Jon is also detailed in the second edition of his book, Generative Deep Learning: Teaching Machines To Paint, Write, Compose and Play, which comes highly recommended by SuperDataScience (and listeners will get a discount code for the book)!
In this episode you will learn: 
  • Generative modeling vs discriminative modeling [04:21]
  • Generative AI for Music [13:12]
  • On the threats of AI [23:15]
  • Autoencoders Explained [38:36]
  • Noise in Generative AI [48:11]
  • What CLIP models are (Contrastive Language-Image Pre-training) [54:07]
  • What World Models are [1:00:40]
  • What a Transformer is [1:11:14]
  • How to use transformers for music generation [1:19:50]
Items mentioned in this podcast:
Follow David:

Follow Jon:

Episode Transcript: 

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 687 with David Foster, author of the book, Generative Deep Learning. Today’s episode is brought to you by Posit, the open-source data science company, by Anaconda, the world’s most popular Python distribution, and by WithFeeling.ai, the company bringing humanity into AI. 
00:00:22
Welcome to the SuperDataScience podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple. 
00:00:54
Welcome back to the SuperDataScience podcast. Today I’m joined by the brilliant and eloquent author David Foster. David wrote the O’Reilly book called “Generative Deep Learning”. The first edition from back in 2019 was a bestseller while the immaculate second edition, which was released just last week, is poised to be an even bigger hit. He’s also a founding partner of Applied Data Science Partners, a London-based consultancy, specialized in end-to-end data science solutions. He holds a Master’s in Mathematics from the University of Cambridge and a Master’s in Management Science and Operational Research from the University of Warwick, both in the UK. Today’s episode is deep in the weeds on generative deep learning, pretty much from beginning to end, and so will appeal most to technical practitioners like data scientists and machine learning engineers. In the episode, David details how generative modeling is different from the discriminatory modeling that dominated machine learning until just the past few months. 
00:01:50
He talks about the range of application areas of generative AI, how autoencoders work, and why variational autoencoders are particularly effective for generated content. He talks about what diffusion models are and how latent diffusion in particular results in photorealistic images and video. He tells us what contrast of learning is, why world models might be the most transformative concept in AI today, and lots on transformers, what transformers are, how variants of them empower different classes of generative models such as BERT Architectures and GPT architectures, and how blending generative adversarial networks with transformers, supercharges multi-modal models. All right, you ready for this profoundly interesting episode? Let’s go.
00:02:36
David, welcome to the SuperDataScience podcast. It’s great to have you here. I understand that you’re actually a listener of the show.
David Foster: 00:02:44
Yeah, massive longtime listener. Thanks for having me on, Jon. Really appreciate it. And really looking forward to getting to in-depth conversation with you about generative AI. Pleased to be here. 
Jon Krohn: 00:02:53
Yeah, I’m glad that you reached out to us about having an episode because you have this amazing book that just came out. It’s really exceptional. Like, I wish somehow I could have written this book. The, it’s so, timely and so, comprehensive around generative AI models, which are obviously the hottest thing right now in the world. Like, there’s nothing that people are talking about more than generative AI. Whether they call it that when people are talking about platforms like ChatGPT or Midjourney, they are talking about generative AI. And so, I was delighted that you reached out as a listener to be on the show, you’re like a celebrity listener out there. Thanks, David. Where are you calling in from today? 
David Foster: 00:03:35
I’m based in London. This is our office here in London in Old Street. Yeah, it’s actually sunny here in the UK, which is a first, finally, summers dawning on us. So, yeah, I’m- 
Jon Krohn: 00:03:44
Nice. 
David Foster: 00:03:44
Excited to be talking. 
Jon Krohn: 00:03:46
All right, well, let’s rock and roll and get right into the content that we have planned for you. There’s so, much to cover today because I know I’m gonna learn a ton filming this episode. And no doubt our listeners are going to learn a lot about generative AI as well. So, you just released the second edition of your popular book. It’s called “Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play.” And so, the first edition came out in 2019. It did very well, and I know that this one is gonna be huge. Can you explain the differences between generative modeling, which is the focus of your book, and discriminative modeling, discriminative modeling, which is the probably up until recently was the much more common type of machine learning? 
David Foster: 00:04:37
Yeah, you, you’re absolutely right. It was, and I think the reason for that is, first of all, it’s just a lot easier than generative AI. If you look back at the history of machine learning the field has been driven by discriminated modeling primarily because first of all, it’s like really useful in business. It’s really useful to a ton of applications and you’ve got a labeled data set, and there’s a very clear outcome that you want to drive. You want to drive predictive accuracy against that label. With generative AI, first of all, the application isn’t perhaps as clear, or at least it wasn’t when the field was in its infancy. But also secondly, it’s really difficult to determine how well you’re doing, right? Because it’s kind of subjective as to whether a piece of text or a piece of art is good.
00:05:19
There’s no such label that you can use to determine that. So, in terms of the differences, like discriminative modeling is all about being able to predict a specific label that you are given about an input. And, and typically you are moving from something that is high dimensional, like an image or a block of text or highly or some structured data, for example, through to something that’s low dimensional like a label or a continuous variable, maybe a house price or something like this. Now, generative AI moves the other direction. It’s saying, can we start with the label and move back to the data? And so, it really focuses on whether the model has understood what patterns are present in this data set so, that it not only can do something like collapse the dimensionality from an image to a label, but it can say, here’s a label dog, cat, boat, ship. Go and find me the data that that would produce this label, i.e. produce me an image that looks like a ship.
00:06:20
And why is this difficult? Well, the reason is because when you’re moving to this higher dimensionality space of say, pixels or word tokens, there’s a lot more that can go wrong. RI is very, very good at detecting something in an image that looks off or something in a paragraph that just grammatically doesn’t make sense. And so, we, we, we really have to try hard when we’re building models like the ones that we’ve seen such as GPT or the diffusion models that I’m sure will come onto later to make them good enough to be plausible. And so, it’s like finding a needle in a haystack, right? To find that one image of a boat that looks real.
00:07:01
We are working in maybe like a thousand dimensional space when we are collapsing stuff down in terms of discriminating modeling, we’ve got to collapse maybe to just one dimension, and that’s a lot easier. So, yeah, I would encourage anyone who’s kind of getting started with machine learning to start with discriminative modeling, because even though generative AI is the hype, you’ve got to know the fundamentals. And a lot of the techniques that come up in generative AI are still fundamentally based in good old-fashioned discriminative modeling, but they have often within them a a slant that makes it like you are predicting something in a higher dimensional space, but you’re still using the same concepts like loss functions, like you know, modeling a a density function, for example. And so, discriminatory model will give you that basis. But if you just want to get started, start there. But you can move pretty swiftly onto generative AI, which is the current hype. 
Jon Krohn: 00:07:55
Yeah. And speaking of swiftly, I mentioned how your first edition came out in 2019, which is just four years ago. The field has changed dramatically since. So, yeah, like, run down for us how different the content is from your first edition to the second edition that’s newly released. 
David Foster: 00:08:14
It is a totally new book. I got to be honest with you, Jon. Like, I sat down with the publisher and they said, do you want to write a second edition? And this was about the time, this was basically this time last year, so, maybe slightly earlier. So, this was before like DALL·E 2 it was before anything with Stability. And I kind of sat down and I thought, yeah, actually I think this is about the right time to write a second edition. Like, there’s, there’s a lot has changed, but ultimately, like I can can move some stuff around. I can move some chapters around, I can update, refresh the examples, refresh the content, and the moment I signed that contract to say I’d write the second edition, it all went nuts. And like DALL·E 2 was released and then suddenly, like, there was just this explosion of large language models and text to image models which is first of all, incredibly terrifying if you’ve just agreed to write a second edition. 
00:09:03
And I realized through the writing process that I needed to completely rewrite the whole book. So, there is so little content that is the same. I would almost be, I think there’s basically none of it is the same. It’s, it’s a new book effectively. And I’m proud of that because it means it’s current, it means it’s up to date. And I can honestly say I’m really proud of it. It’s, it’s something I think takes you from beginner through to understanding that the entire landscape of generative AI models. It doesn’t just focus on one model type or one you know, what’s currently in vogue. It tries to take you on the journey from let’s just lay the fundamentals down in the foundations through to, okay, now let’s talk about Stability and, and Stable Diffusion or DALL·E 2 or Midjourney. Let’s, let’s really get to grips with what these models are doing. And obviously GPT and the OpenAI series. So, yeah, I, I’m really proud of it and, and I think I feel privileged to be in the position where I can, I can write this book. I think hopefully lots of people will get a lot out of it. And I’m, I’m really excited to see it in the market. 
Jon Krohn: 00:10:09
I wouldn’t be surprised if this addition became like a standard in the field and based on what’s covered in here, how well you covered it it’s so, comprehensive. And the kind of praise that you got in the outside of the book kind of backs me up. You got François-Charles, the creator of Caris, is writing about how great he thinks the book is. You’ve got the head of strategy at Stability AI, the company behind Stable Diffusion. You’ve got senior people at Microsoft Azure. You’ve got people from EleutherAI, which is like in recent episodes Five-Minute-Friday episodes. I’ve been talking a lot about the open-source, large language models that Eleuther have made available. You’ve got Aishwarya Srinivasan who is this extremely famous content creator who works at Google Cloud. I mean, yeah. So, it’s not, I’m, I’m just kind of backing myself up. I’m, I’m quantitatively now. I’ve said I’ve given so, many of these, like, so, yeah, I think your book has a lot coming out. You say something, I just spoke over you. 
David Foster: 00:11:11
No, yeah. I feel really privileged that these people have taken the time to leave a review and to actually read the book and say, you know, that they think it’s a, it’s a useful addition to the library. I think you know, when I, when I look back, I’m standing on the shoulders of these giants, really. I mean, I, I’m just reporting on their incredible work. So, you know, I wouldn’t be able to write this book without what they’ve done. You know, particularly to someone like François-Charles who’s basically created the library that I’m using throughout the book to build practical examples of generative models. So, yeah, really, really feel really privileged. 
Jon Krohn: 00:11:43
And then you used open-source LLMs from Eleuther to just write all the content.
David Foster: 00:11:48
Oh man, I’ll tell you what. I was, yeah. I missed the ChatGPT stuff [crosstalk 00:11:52] by a year. Like, if I was starting to writing the book now maybe it’d be bit easier.
Jon Krohn: 00:11:58
Yeah, I, well, I joke. I think, you know, obviously there was actually a really interesting discussion on the Last Week in AI podcast, which is hosted by Jeremie Harris and- 
David Foster: 00:12:09
Yeah, I know that one. Yeah. 
Jon Krohn: 00:12:10
Yeah. And Andre I can’t remember his last name, but the, Jeremie’s been a guest on the show a number of times. And they were recently talking about how they, you know, for, for like online content that’s like listicles, like buzzfeed type stuff, that’s very easy to automate, but New York Times journalists where you have to be, you know, doing investigative reporting, you know, you could be working on one story for months and really digging into things and interviewing people, like that kind of job isn’t going to go away. You can be augmented a bit, you know, you can help with making sure that you are doing everything grammatically correctly. That, and, and, you know, suggesting some small parts of what you’re doing, but with a book like yours that is so, technical, so, advanced, so, cutting edge while these tools could be augmenting your writing absolutely, it couldn’t actually be generating all the, like, it can’t be, it can’t be generating all the content. Not yet. 
David Foster: 00:13:09
Yeah, exactly, 
Jon Krohn: 00:13:11
Yeah. So, language generation, like text as well as audio. These are some of the examples of generative AI. Images you talked about with like DALL·E 2. What other application areas are there? 
David Foster: 00:13:25
Yeah, we cover the lots in the book. So, there’s, for example, music is a, is a field that I find particularly fascinating. I’m a musician myself. I can see you got your guitar there and in the background on the, on the YouTube video I, I’m really surprised actually that music generation hasn’t really taken off in the same way that language generation has, because, you know, many ways you think it’s perhaps a little bit easier because there’s so, many genres of music and, you know, like we’ve got to arrange these audio waves in such a way that’s pleasant to the year, whereas words have a grammatical structure and there’s very strict rules about what we want to see. But, you know, I sort of think to myself, why is that? And I wonder if it’s put in part because of the lack of data that’s available. There’s obviously a ton of text data available on the web, not as easy perhaps to find music data in a, in such quantity. Perhaps it’s also driven in part by necessity. And large language models are also extremely useful. So, yeah. We cover it in the book though. So, music generation. 
Jon Krohn: 00:14:22
I, so, I’ll just, I’ll just quickly interrupt you on the music thing. I think that you’re absolutely right. I think it’s the, I don’t think it’s the [inaudible 00:14:28] of data, although there is obviously a lot of language data out there, there is a lot of music data as well. I think that you hit on it right at the end there, which is that very few people are employed in creating music, but almost all white collar workers, our lingua franca, like our, the medium that we, that we intake as well as output is in text. And this became even more obvious through the pandemic when you saw so, many jobs could be done remotely, where it’s just like emails and Slack messages. And so, it’s text in, text out for a lot of what we do. 
00:15:08
So, I think that’s why it’s, you know, it’s something that’s talked about more, but it is interesting. There has been an explosion in generative music. So, Spotify apparently has a thousand tracks, a hundred thousand tracks uploaded to it every day. A hundred thousand tracks a day. And almost all of that is AI generated music. And the reason why that happens in such, because you think, well, what’s the point? Why are people wasting server time uploading that? Is that then they also have bots that listen to those fake tracks, which brings in royalties for these people [crosstalk 00:15:45]. But but Spotify’s starting to crack down on that. Anyway. 
David Foster: 00:15:49
Yeah, I can imagine. I, yeah, I think it’s interesting to see where this goes because I know for example acquaintances I guess with the VP of Audio at Stability AI, and, and he, he is first and foremost a composer, so, he’s not someone coming at this from kind of machine learning perspective. Firstly, he’s someone who’d come at this as a composer, so, he cares deeply about the rights and the authenticity of the music that’s being generated, but seeing the potential for a different kind of music that we are listening to in future. So, yeah, it’s been exciting to see how, how platforms like Spotify jump on the on the bandwagon here.
Jon Krohn: 00:16:26
Absolutely. 
00:17:01
This episode is brought to you by Posit: the open-source data science company. Posit makes the best tools for data scientists who love open source. Period. No matter which language they prefer. Posit’s popular RStudio IDE and enterprise products, like Posit Workbench, Connect, and Package Manager, help individuals, teams, and organizations scale R & Python development easily and securely. Produce higher-quality analysis faster with great data science tools. Visit Posit.co—that’s P-O-S-I-T dot co—to learn more.
00:17:01
All right. So, I interrupted you a while ago you were going to transition away from music to another application area for generative AI. 
David Foster: 00:17:12
Yeah, sure. So, we cover music in the book, but also other modalities, especially cross-modalities. So, you know, we’re talking about things like text to image and also interestingly kind of things like text to code, which I guess is another kind of language model, but a very specific type, kind of language model. But also in the final chapters how reinforcement learning plays a part when we’re talking about things like world models where there’s a generative model at the heart of the agents, which is trying to simply understand how its environment evolves over time. And then layered onto that is the ability for an, the agent to use this generative model to understand what its future might look like, and therefore hallucinate different trajectories through its action space. So, yeah. We, we might come onto this in a bit more detail later. 
Jon Krohn: 00:17:57
Oh yeah, we will be. 
David Foster: 00:17:58
It’s safe to say, yeah, we, we got it all covered in the book. 
Jon Krohn: 00:18:00
Awesome. Yeah, there’s so, many exciting topics to cover in this episode. Yeah. So, application areas that I’ve now, I think, jotted down relatively comprehensively. You got text generation voice, music, images, video, code, multimodal models, tons of different areas. Really exciting times. So, in what way do density functions serve to distinguish these different generative AI techniques from each other? 
David Foster: 00:18:30
Yeah, that’s a great question. So, I would say if I just sort of briefly talk about, you know, the, the, how we cover this in the book. So, the first section of the book is we call methods. And this is where I’m laying out the six fundamental families of generative AI model. And the second half is based on applications. So, like, what can you do with them? Now, the six families of model are differentiated by how they handle the density function. So, let me give you an example. The first split that you can kind of make is between those that implicitly model the density function and those that explicitly model it. And what I mean by that is imagine the density function is basically like a landscape over which you are trying to move to find images that are more likely to be real than others. 
00:19:18
And the images that are most likely to be real are, say, at the bottom of the valleys, and the least likely to be real are at the top of the mountains. So, you’re always trying to move downhill in this in this, in this model. And you are trying to come up with a landscape that truly reflects how real images are produced. So, we are kind of like postulating that this landscape really does exist and that we need to find a model, like an abstraction, if you like, of reality that captures the true nature of this. So, if you imagine the different dimensions of this landscape are the pixels in an image, then there are some configurations of pixels that are in the valley, i.e. they are producing very realistic images, and there are some configurations of pixels that are on the mountains, and they aren’t very realistic. So, the question always becomes, firstly, how do you model this landscape? What does it actually look like in this very really high dimensional space? And secondly, how do we navigate it? How do we move downhill to find images that look real? So, implicitly you can model this by something like a GAN where you don’t actually write down an equation of what this model looks like, but you play a game between what’s known as the generator and the discriminator. And the generator- 
Jon Krohn: 00:20:29
To quickly jump in for audience members that, don’t know that term GAN it’s Generative Adversarial Network. 
David Foster: 00:20:34
Yeah, exactly. Generative Adversarial Network – GAN and you’re basically playing a game here between the generator that’s trying to create images that look real and the discriminator that’s trying to pick between those that are, that are real and are not. And so, at no point in that process do you write down an equation that, yeah, this is what I believe the density function to be. You’re implicitly modeling it through this game. And that is in contrast to pretty much every other kind of model that does in some way try to try to create this density function. That’s which we call, usually call p(x). So, in this other set of models, there are different ways of dividing it up. And one of the ways, for example, is to say, okay, we can approximate it in some way. We’re not gonna try and find it perfectly, but we’re gonna approximate it.
00:21:21
So, variation auto twin coders do this and, and some other model types as well. On the other side, you can also find models that, that try to model it really explicitly, such as your autoreggressive models where you basically place some constraints on how the generation is produced. So, autoaggressive models always look to produce one sequence step ahead. So, something like GPT is a good example of this where you’re just predicting the next word or token at the time. And you can write down an equation that says like, this is what I believe the landscape to look like because I am, I’m restricting it to just predicting the next word. So, if you wanted, the equation would be huge, but obviously you can write down, you know, what that looks like. 
00:22:04
And then you got some other types like normalizing flows where you enact a change of variables on the landscape and you try to, you try to morph the landscape into something that is easier to sample from. You’ve got energy-based models, which are the fundamental root of diffusion models, which again, we can talk about later. Again, this is basically like saying, how can I come up with a function that tells me how to move downhill in this landscape? And then, yeah, I think that covers it. That’s our six kinds of model. Yeah, so, they all kind of try to model this density function slightly differently. But ultimately it’s, it’s a fundamental part of generative AI is understanding what we mean by a density function. And we cover that in the first chapter of the book. 
Jon Krohn: 00:22:43
Sweet. Which is why we’ve kicked off with that here. So, something that’s happened recently at the time of recording is that Jeff Hinton, who is perhaps the single most important person in the history of deep learning and deep learning is essential to all of these generative techniques that you’ve just been describing. Indeed, your book is called Generative Deep Learning. I’m not really aware of contemporary generative approaches that don’t use deep learning. 
David Foster: 00:23:11
Correct, yeah. It would be pretty much null. 
Jon Krohn: 00:23:13
So, Jeff Hinton sometimes called the Godfather of Artificial Intelligence, but probably like more accurately the Godfather of Deep Learning. And he won the Turing award with Yoshua Bengio and Yann LeCun, so this is like the equivalent of a Nobel Prize for computer science. And he was at Google for a very long time. He recently left at the time of recording this at least. And he cited significant concerns about the misuse of generative AI as the key reason for him leaving. He wanted to be able to express himself more clearly. He was, he’s actually clarified since that he doesn’t think Google is doing a bad job, but that there’s pressing concerns here and that he needs to be able to speak freely about them. So, do you agree with Jeff Hinton? Yeah. What do you think about this whole situation? 
David Foster: 00:24:09
Okay. There’s a few things to unpack here. So, first of all, massively respect Jeff Hinton’s work. I think a lot of us wouldn’t be doing what we’re doing without his fundamental breakthroughs in the field around things like backpropagation, obviously in the early days of deep learning. So, yeah, it’s worth listening to what he says, first of all, because I think he’s got valid points and he, he puts ’em very eloquently across in his interviews. First of all, I would say it’s important to note here that the difficult position to take in this is that we’re gonna be fine. And the reason I say that is because it’s very hard to prove somebody wrong that says AI is an existential threat, because if it hasn’t happened yet, then they can just say, well, it hasn’t happened yet, but it will happen. 
00:24:51
So, you’re kind of always in this position of like, well, how do I, how do I show that this argument I don’t particularly agree with? How, how do I show that I don’t think it’s an existential threat that we can put things in place to prevent the threat from happening, or that it’s just not a viable threat in the first place? So, I think you’ve got to first of all, think really hard if you’re gonna come out and say, I think AI isn’t an existential threat. And I, and have been doing a lot of thinking about this, you know, listening to arguments on both sides, and I think there are, there are hugely valid points to be made, but ultimately I’ve come down on the side of not thinking it’s as great a threat as perhaps the likes of Jeff Hinton I are, are putting out there.
00:25:29
And I think one of the reasons, and one of the criticisms I perhaps would make of the argument that it is, is that I don’t like the idea of just waving the hands and saying that the AI will want to take control. I think there’s a huge leap here from saying that we have large language models, which now predict the next word very, very accurately, and of course can be chained with tools and on all of those things to then saying that those same language models will have wants and desires and long-term aspirations to achieve a particular goal. I really don’t think that a model which is ultimately interpretative. These models, whilst they look as if they’re doing very, very clever extrapolation, I believe ultimately are still confined by the dataset that they are trained on. 
00:26:19
I don’t, I don’t think that, and I, you know, I might be wrong with this, but I just don’t think that they’re gonna have the capacity to want to eliminate us. And that is ultimately what he’s saying. And to be clear, you know, this is very separate from saying bad people will do bad things with AI. And I think they will, there’s no question. I mean, we see it with every technology that, you know, bad people, if they want to do bad things with the technology, they will. And I agree with him there, that, that we need to be extraordinarily cautious that we don’t let that happen and put the regulation in place to ensure that it doesn’t. But there’s a huge leap to then say, like, the AI itself is gonna want to dominate us just because it’s more intelligent than us, or apparently more intelligent than us. 
00:26:58
I think we are downplaying our own capabilities here. You know the example I would make, make as a counter-example perhaps, is if you trained a large language model on all scientific data or just all, all data up until say 1910, would it come up with general relativity? And I just don’t think it could, I don’t think it can make that extrapolate leap that says, given the data I have available to me at the time, I can run this thought experiment myself and want to run the thought experiment to come up with something as profound as relativity. I can’t see that happening. And therefore, like, it makes me, it leads me to the conclusion that we’ve got, we’ve got something worth sort of fighting for against this AI. And we shouldn’t just lay down and say, yep, we’re now on path to existential annihilation because we’ve built something that can predict the next word very, very well. I’m optimistic, basically. 
Jon Krohn: 00:28:34
Did you know that Anaconda is the world’s most popular platform for developing and deploying secure Python solutions faster? Anaconda’s solutions enable practitioners and institutions around the world to securely harness the power of open source. And their cloud platform is a place where you can learn and share within the Python community. Master your Python skills with on-demand courses, cloud-hosted notebooks, webinars and so much more! See why over 35 million users trust Anaconda by heading to www.superdatascience.com/anaconda — you’ll find the page pre-populated with our special code “SDS” so you’ll get your first 30 days free. Yep, that’s 30 days of free Python training at www.superdatascience.com/anaconda. 
00:28:37
Yeah, there’s a lot of different ways we could go with this. And I’m, I’m not gonna let us, I mean, we could literally spend this entire episode talking about this stuff but we have a lot of technical stuff that I’d like to get into with the generative AI that you specialize in. So, I’m not gonna, I’m not gonna drag this out too long, but there are, there are interesting things where, so, yes, today models like GPT-4, they’re predicting the next word. They’re, they’re not in and of themselves at risk. But if you’ve, you know, we have tools like AutoGPT that were built on it where AutoGPT could potentially be given a large amount of resources, including a lot of its own GPT-4 agents, and we could give that auto GPT a broad task, like, here’s a million dollars increase the amount of money. 
00:29:22
And one person might say, increase the amount of money, but also don’t break any laws. Whereas another person might not give that qualifier or even without breaking any laws, it might figure out a way you know, that takes advantage of some people to generate more money in the bank account or, so, it, and while AutoGPT today might not be too sinister, you know, maybe we’re with, with how crazy things have become just in the last year, like, so, you talk about signing your book contract a year ago and the incredible progress that’s happened over that year. If somebody had asked me a year ago whether I thought a model like GPT-4 could exist in our lifetimes, I might’ve said, I don’t know. That’s [crosstalk 00:30:09] So, I don’t know. So, what is even just scaling like that, that huge innovation has come about through just scaling the same architecture transformers. 
00:30:23
And so, you know, scaling that another 10 times or another 100 times before it gets prohibitively expensive to train. You know, there’s like, we can’t do that many orders of magnitude before, we’re talking about like a hundred billion dollars to train a model, which, so like, there’s probably also going to be scientific breakthroughs beyond just the engineering breakthroughs that we’re doing today on scaling bigger and bigger, bigger. So, anyway, so, I just, so, it seems like I can get why people including Jeff Hinton are concerned about the existential risks, but tying kind of more immediately and more into the kinds of concepts that are covered in your book, he also expresses concern about just, you know, fake content, misinformation, which you alluded to there, you know, there that that is the immediate risk, like with the tools we have today, anybody who wants to misuse them can, and they can do things incredibly powerfully. 
00:31:18
You know, just, just tying lawyers up with, like a specific example I read yesterday was, I think this was in The Economist, they gave this example of how a person could create a thousand-page document as to why, like a NIMBY someone who doesn’t want (so, Not In My Backyard – Nimby). Somebody who a NIMBYist they could create this thousand-word proposal for a government official to read about why they don’t want you know, electrical wires visible from their back window. And a human, then probably is gonna have to read that and respond to it. So, there’s just, there’s all these interesting things, all like, and that’s not even a misuse really of the technology. But there’s, there’s so, many this scale now that we can create language at it, it’s, it’s going to cause problem. And so, it doesn’t seem like you are too concerned about it. So, I guess, yeah. Why aren’t you that concerned about the immediate risks? Or do you already have in your mind ways that we can overcome these risks perhaps with AI itself? 
David Foster: 00:32:24
Yeah, so, I would say the immediate risk that I’m slightly more worried about, it’s the existential risk that I, perhaps that is overplayed, the immediate risks of disinformation and the ability for large language models to create a huge amount of noise in our world. Whether that’s creating work for people like, you know, lawyers reading the document that you just mentioned, or just the fact that it might nullify the power of things like social media platforms if we can’t really determine, you know, what’s real and what’s fake as well as democracy itself, if we are now influenced by a media content that isn’t correct or isn’t real, I think is more of a risk. And the way I would like to see this handled is first of all, education. So, I think we are gonna have to get used to a world where we need to be a lot more vigilant of what’s real and what isn’t real. 
00:33:12
 I think we’ve been extraordinarily privileged actually to live through the start of the internet era being relatively free of fake content. And I think that has generated a huge amount of a huge amount of worth in things like, for example, programming where before I would have to go and buy the book on Python if I wanted to learn how to do something. Now I can just go online and I know I’m gonna find an article written by a human that tells me exactly how to do what I want to do. So, there’s a huge amount of value that’s been created by that. And I think that value is now being condensed into the model such as GPT-4, which is gonna be even more powerful than like me trawling through hours and hours of Stack Overflow content to find out how to do something in Pandas, which is what I usually end up doing. 
00:33:55
So, you know, on the one hand it’s gonna actually improve efficiency like this. But also, like you say, I think we just need to be extraordinarily careful that we don’t let this thing, you know, run away with itself. And I just, you know, humans are incredibly slow to react to new technologies like this. We often need some sort of event before we go, “Ah, yeah, we don’t want that happening anymore.” And nobody really knows what this event is gonna be. I was talking to a AI IP lawyer earlier, and she’s along the same lines that she’s like, it’s very hard to get people to take notice or listen before something happens that kind of, we go, “Yeah, that’s the thing we didn’t want to happen.” So, I think this is in line with kind of like some of Sam Altman’s comments and, also Yann LeCun’s comments around like, how can we start legislating against something that we don’t, we don’t know. And it’s, you don’t, you don’t want to kind of stifle innovation. You don’t want to stifle research just because you are worried that something might happen, you know, otherwise you just legislate everything. So, look, I don’t have all the answers, but I just, I, I’m optimistic and I I would hope more, I would like to see more people optimistically trying to come up with solutions rather than just kind of pointing out that there’s an annihilation around the corner, which I just don’t think is credible at the moment. 
Jon Krohn: 00:35:09
Yeah. And I think that there can be, I think that AI itself can be used to solve a lot of these issues, so, Jeremie Harris, whom we’ve already talked about, he is a he, he has a lot on his show Last Week in AI about ways that we can be mitigating some of these risks. And one thing that he talks about regularly, and then he talked about even on our show, so, we had him on for a GPT-4 risks episode, it’s episode number 668. And in that episode he talked about how we can be using AI to be monitoring AI because it’s much faster than us. So, we can’t have people monitoring for slight aberrations, but we could train AI to be trying to keep it in line. So, that’s like the existential risk thing. That’s I guess a leading approach today for how we deal with that. 
00:35:58
And even with the misinformation stuff, I mean, we can have misinformation detectors that are automated. And I’m usually pretty skeptical about the crypto hype and blockchain in general, but a real-life application of the blockchain that was first brought to my attention by Sadie St. Lawrence, I believe in episode number 537 of this show was that you can be using the blockchain to verify that a document is real. So, if there’s a source that you trust, like the New York Times or The Economist or whatever, and then an image or a video, it can be tagged. I don’t know the terminology very well, but you can, you can verify on some blockchain that like, okay, this actually really came from that trusted source. 
David Foster: 00:36:45
Yeah. Attribution is gonna be a critical thing that we have to, we have to care about going forward. And, and I think what’s important is that we don’t sort of say, we don’t make it black and white. This is AI generated, this is not AI generated, because ultimately it is a gray zone. If I use an AI tool to generate a structure of a document, but then I fill in the blanks, or I extrapolate, like, I don’t really want to be having to label that as AI generated because ultimately it’s had my eyes on it, I’ve overseen the process. It’s a bit like if I use a tool, you know, like a spell checker, I don’t have to declare yes, I’ve spell checked this document. I just put it out there because it’s had my eyes on it.
00:37:21
But what I think we need to label is anything that is AI generated that has had no human eye look over it. And that’s where I think we might need to start saying like, okay, if this, if this content has been produced and no human has had any part of production of that content, I think people should know about that. And I think it’s important that we can distinguish, or at least label in some way any content that that has gone out unverified, because that’s where you might start to see the problems. And I go back to my example there of you know, Stack Overflow. If there’s content on there that is an answer to a question that has been AI-generated, I kind of want to know if I was reading it, like take this with a pinch of salt because it might not be something that someone has actually produced. It’s might still be useful, but it’s not human-generated. It’s say, AI-generated. 
Jon Krohn: 00:38:12
Nice. Yeah. So, there are risks, but we can mitigate most of the risks. And I think it’s good that people are calling these risks to our attention. And so, hopefully we can get ahead of them to some extent. And the most audacious issues can be tackled upfront. So, let’s move back to technical stuff. So, one of the fundamentals of generative AI is autoencoders. So, we talked about density functions earlier. Let’s talk about autoencoders. These are a really key concept in generative AI. And so, there’s this idea of encoding information. So, you know, let’s, let’s take the example of a text-to-text model. So, this is like the ChatGPT experience, we provide text to the model, it encodes that text, and then it gets encoded into something called a latent space. And then there’s a decoder that takes that latent space information, converts it into new text that then in this example that I’m giving ChatGPT provides text back out to us. So, encoding text, latent space representation, decoding – I think altogether this describes an autoencoder. So, yeah, maybe fill us in a bit more on what these terms mean and what important role they play in generative AI systems.
David Foster: 00:39:32
Yeah. Cool. Great question. I, I’ll take it back actually to an example with images. Cause I think it’s slightly easier to sort of visualize for your listeners. So, let’s imagine we’ve got an image and it’s in like, it’s a 1024 pixels, so, high dimensional space. Every single one of those pixels has three color channels. So, you’ve got a lot of numbers basically that describe that picture. And as we previously mentioned, there is some density function that describes why that image is very likely to be a true image and other noisy images aren’t. Now what autoencoders look to do is say, can we map this high dimensional space of the image domain to a latent space, what is known as a latent space of a lower dimension? So, you could even map to a latent space of two dimensions. Then it’s very, very easy to visualize.
00:40:18
You’re just imagining a plane. And then on that plane there are mountains and valleys and that determines whether some of those points in the latent space are likely to be generated and some aren’t. Now, the reason why this is useful is because it forces the model to make generalizations over the pixel space, so that it can compress that information into the latent space. It’s a bit like if someone said to you I always give the example of like cookie jars or biscuit tins, which are cylindrical. How many numbers do you need to describe the shape of that biscuit tin? The answer is two, you need to know the height and you need to know the diameter of the, the cross-sectional circle. If you know those two numbers, you could reproduce the biscuit tin.
00:41:04
So, even though this thing exists in three dimensions and it’s, you could view it from different angles and, and come up with different, you know, pixel pictures of it, actually you can describe it using two. And in that latent space you could basically move around it to produce different kinds of cylinder. And the same thing, exactly the same thing is true with models like diffusion models or even various autoencoders. You’re basically saying to the model, find me a low dimensional latent space where if I choose any point within that latent space over some distribution, like a normal distribution centered on the origin, I am pretty likely to find something that is truly a real image. And then, so, what the decoder is trying to do is move back from the latent space to the pixel space. Can you take these two numbers and recreate the biscuit tin? So, if you join those two things together, you’ve got what’s known as an autoencoder because it’s trying to effectively compress the information down to something small and then expand it back out again to the original image. It’s autoencoding itself.
Jon Krohn: 00:42:12
Nice. Great explanation. I love that three-dimensional biscuit tin cylinder represented in two dimensions. That’s such a crisp way of describing how this latent space can, can contain information like that. Awesome. So, there are different kinds of autoencoders. We’ve got variational ones for example which are more popular today. So, how do variational encoders differ from traditional ones? What unique capabilities do they offer?
David Foster: 00:42:40
Yeah, so, the problem with vanilla, let’s call them vanilla autoencoders, so, not variational, is there’s a few problems first of all. If you just let the model map to any old latent space, like you just say take the pixel space and I just want you to find two numbers that kind of represent what that image is so, that you can decode it. The problem is it’s very hard to sample from that two-dimensional space because, okay, so, let, let’s say the point a 100-100, is that a valid image? What about 200-200? Like 2 million-2 million? You know, like where in this, where in this still pretty vast space should I be sampling? And so, what you end up with is a very difficult latent space firstly to sample from without much structure. It’s got no incentive to kind of pull similar concepts together. Because ultimately it’s, it’s unconstrained.
00:43:32
What a variation autoencoder does is it makes a very slight change to the loss function. And it effectively says you’ve got to include a term which makes sure that the points when you map them into this latent space are as close to a standard normal distribution as possible. By a standard normal distribution what I mean is a normal distribution with a mean of zero and a co-variance or standard deviation of 1. So, we know how to sample from this object. It’s, it’s really common and we know exactly how it works. And by doing that, what happens is first of all, everything gets compressed to something that looks like a normal distribution, and that helps us in two ways. First of all, it means that there is a degree of continuity in the latent space. So, you can move around it and be pretty sure that anything within this normal distribution is gonna be something that’s likely to be a real image. And, you know, if you move to the extremes, then you’re gonna find something that’s less likely. 
00:44:30
But we can, we understand what this distribution means so, we can sample from it really easily and make sure that if we’re choosing random points from a normal distribution, standard normal, that we are gonna be able to decode these points to a real-looking image. So, it’s a bit like the glue, the variation autoencoder is a bit like the glue that glues everything together and makes it a true generational model that we can sample from and not just this abstract autoencoder object, which doesn’t really, it’s not very easy to work with basically.
Jon Krohn: 00:44:59
Nice. The future of AI shouldn’t be just about productivity. An AI agent with a capacity to grow alongside you long-term could become a companion that supports your emotional well-being. Paradot, an AI companion app developed by WithFeeling AI, reimagines the way humans interact with AI today. Using their proprietary Large Language Models, Paradot A.I. agents store your likes and dislikes in a long-term memory system, enabling them to recall important details about you and incorporate those details into dialog without LLMs’ typical context-window limitations. Explore what the future of human-A.I. interactions could be like this very day by downloading the Paradot app via the Apple App Store or Google Play, or by visiting paradot.ai on the web.
00:45:44
Great explanation. Crystal clear. So, variational autoencoders they allow us to constrain distributions to standard normal, which leads to better behavior in the autoencoder. We get better results. 
David Foster: 00:45:57
Precisely. Yeah. And the term in the loss function to do, that’s called the KL divergence – Kullback–Leibler divergence. And it’s the glue that makes it the first kind of generative model I would recommend everyone starts with.
Jon Krohn: 00:46:08
Nice. Yeah. That KL divergence, that’s big in information theory, right?
David Foster: 00:46:12
Yeah, it’s a way of measuring the difference between two distributions. So, like if you’ve got your distribution of points and you want to compare it to the standard normal, you could use the KL divergence to do that. The beautiful thing about this is that it’s actually got a closed-form solution for a standard normal, which means that you don’t actually need to do any sampling to work out what this value is. You can just write, write down the answer, which is super powerful. 
Jon Krohn: 00:46:34
Cool. All right, so, we’ve got kind of the key terms now under our belts for generative AI. We know about density functions, we know about the application areas, we know about autoencoders. So, let’s talk now about that big breakthrough, the captured public’s imagination. So, even before you signed your book a year ago, there was already a lot of hype around DALL·E. So, this was released by OpenAI, the same company that released ChatGPT. This is a text-to-image generator, and the original DALL·E, while miles behind the DALL·E 2 that came out shortly after you signed your book deal even the original DALL·E for some kinds of requests, it created stunning imagery like being able to, on their website for example, there were examples of you being able to say, I want a shark in a tutu walking a crocodile or whatever. And it could create a cartoon of that.
00:47:37
And, you know, compared to DALL·E 2 or Midjourney, it wasn’t that many pixels. It was definitely better at cartoony-type stuff relative to photorealistic stuff. But still, this was like the first time that I, and probably most people were able to have this unbelievable creative outlet of being able to take any text that comes to your head and that automatically generate an image. So, that DALL·E model, it leverages diffusion and your book has an entire chapter devoted to diffusion. So, can you explain what diffusion is and how noise can be employed in the generation process? 
David Foster: 00:48:17
Yeah, sure. Let start with diffusion then. So, DALL·E actually is made up of a few components. Diffusion is used in, in a few of them. So, it’s definitely a core component of DALL·E 2. So, yeah, great place to start is let’s, let’s first of all explain what diffusion is. So, diffusion the best way I can describe it using kind of a metaphor is, imagine that you’ve got like a set of TV sets all linked together in a long line and the first TV shows just random noise. So, just complete random static. And the last TV in that sequence shows an image from your data set. Now if you want to move from the image in your data set on that television to the random noise, it’s very simple. You can just add tiny, tiny bits of random noise to that image in tiny steps, just kind of Gaussian noise, which basically means noise sampled from a normal distribution and eventually over enough time steps you won’t be able to tell what that image was. It’s got, it’s basically as good as random noise.
00:49:16
So, you’ve kind of moved from the image domain of your data set through to the noise domain, which we can sample from. And with generative AI, where you are always trying to get to is, can I sample from this thing? Because if you can sample from it, that means you’ve got this random point that you now just need to decode. And so, you know, we talked just now about encoders and decoders. The adding noise is a bit like encoding. It’s not quite the same because it’s actually it’s not a learned model. It does this, it’s just noise edition. But what the beauty of a diffusion model, it is that it learns the reverse process. It learns how to undo the noise and get back to the original image. Now you might say like, well, how on earth does it do that? Like, how do you just out of random noise find an image? 
00:50:01
But you can kind of think to yourself, well, if you do this in enough, in small enough steps, then this is kind of possible because you can say to yourself, well, let’s imagine your data set was just images of houses, okay, outdoors. So, most of the time the upper pixels will be blue because they’re the sky, and you’re gonna have some kind of maybe greeny pixels down the bottom. So, to get from random noise to an image, you might train a model to say, well, let’s try and keep some of the green pixels at the bottom. And I, you know, I think they’re the ones that need to be adjusted in such a way that they’re slightly more green. And the pixels at the top, I want you to adjust those in such a way that they stay roughly more blue than the other pixels in other parts of the image.
00:50:47
And it turns out that if you do this in enough time steps and small enough steps, the model through taking what it already has and making a slight adjustment that makes it slightly more like an image, can make random noise turn almost like before your eyes magically into, back into something in the image domain. And the way that the diffusion model actually works, the nuts and bolts of it, it’s something called a U-Net model which it doesn’t try to unlike a variational autoencoder, which kind of yeah tries to move from say the latent space back to the original pixel space in the decoder. This U-Net model just simply maps the image to another variation of the image with slightly less noise. That’s what it’s trying to do. And yeah, if you do this over enough time steps, then it turns out you can train a pretty good model to learn how to decode noise back into the original image domain. So, yeah, that, that’s how they work. Diffusion models are all about units and they’re all about adding noise through a forward step and then trying to remove the noise through a backward step. 
Jon Krohn: 00:52:00
Nice. And so, I guess that’s how Stable Diffusion works as well. So, that’s the, so, behind Midjourney at the time of recording, Midjourney version 5 is the state of the art. It creates amazing photorealistic images. And so, this same kind of approach is in behind there. It’s probably just scaled up, right? Probably more like, yeah, larger model architecture and more data, more trading data. 
David Foster: 00:52:23
Yeah. And the beauty of Stable Diffusion is in something an advancement that they made called latent diffusion. And this is where, and all of these ideas are kind of tying together now that we’ve talked about, because what latent diffusion does is it works in the latent space. So, there’s actually an initial part of the model that tries to compress the image down to something that isn’t pixels anymore. But it’s like, it’s a latent space of concepts effectively. And then latent diffusion works in the latent space. The diffusion model just works on this. And then there’s a decoder effectively that sits after this that takes the de-noise latent space back into pixel space. So, the, yeah, what they realized was that you don’t need to work on the pixel space itself because you’ve got a lot of redundant information. You can work in a much smaller and faster latent space. That’s the beauty of it. That’s why it’s so, good.
Jon Krohn: 00:53:15
Nice. That makes perfect sense. So, the distinction between latent diffusion, this newer technique that powers say, Midjourney version 5 relative to the diffusion that’s been around for all these years, all these months is that it allows for diffusion on the latent space, which as we’ve talked about earlier in our discussion of how we use an autoencoder, for example, to go from an encoder into a latent space and then we need to decode that later. The latent space there, like your 3D biscuit tin, how that can be represented with just two pieces of information. Similarly here, when we’re doing diffusion on the latent space, we’re doing diffusion on more compressed information. And so, it’s more computationally efficient, easier to scale up, we get better results. 
David Foster: 00:54:03
Yeah. Perfect. Exactly that. 
Jon Krohn: 00:54:04
Nice. Cool. And related topic is CLIP models. So, what are CLIP models and how are these leveraged in these kinds of text-to-image tasks that we’ve been talking about, like DALL·E and Stable Diffusion? 
David Foster: 00:54:19
Yeah. Cool. So, a CLIP model is one part of DALL·E 2. And specifically, I’ll come into exactly which part and, and how it’s used. Because CLIP itself isn’t a generative model. CLIP itself is actually, it uses a technique called Contrastive Learning to effectively map pairs of text and images. So, you could imagine, say you’ve got a data set where you’ve got loads of pairs of images and their corresponding descriptions. So, let’s say you’ve got a picture of a field with a tractor in, and then you’ve got a text description that says this is a field with a tractor in on a sunny day. Okay. So, what CLIP does is it tries to learn a model that can match the image to its matching text description. And the way it does that is it trains two different kinds of transformer which we can, we can come onto the details of. A transformer for the text side, basically says, can you encode this text description into a vector? And the transformer on the image side, which says, can you encode this image into a vector? 
00:55:21
And then what it’s doing is taking these two vectors and quite simply just calculating cosine similarity between them. And what you want is you want true pairs to have a very high cosine similarity score, and you want mismatched pairs to have a very low similarity score. And that is what the CLIP training process does. It tries to find this kind of like identity matrix of, along the diagonal you get very high scores because you’ve, these are the matching pairs along the diagonal, if you can imagine the images in the, on the rows and the texts on the, on the columns and on the off diagonal, you want this to be as, as small as possible because you don’t want these things to be regarded as similar. So, it’s a bit like a, like a recommendation algorithm, you know, like, is this image recommended to be with this text? And so, this isn’t generative, right. We’re not gonna be producing more images through this.
Jon Krohn: 00:56:18
One of the cool things about this, I think because OpenAI released this CLIP model standalone as well like this. And so, one of the cool things about this approach that, and it follows on from what you were just describing, is that this allows you to have an image classification algorithm that didn’t necessarily have the label that you’d like to extract in the training data. So, you, so, when we were, 10 years ago, the state of the art, and up until very recently, the state of the art in image classification, so, with models like Jeff Hinton’s AlexNet that came out in 2012, that was trained on the ImageNet dataset, which had tens of thousands of different labeled categories, cats, horses, it had tons of different kinds of dogs, because they used that as like they wanted the model to be able to demonstrate that not only is it good at classifying a wide range of images, but also for a specific like category of images. It could distinguish fine details and be able to distinguish a Yorkshire Terrier from an Australian Silky Terrier, even though these are extremely similar-looking dogs. And so, the state of the art was that you needed to, I guess going back to one of our first topics in this conversation, talking about discriminative models, where we were discriminating down to specific class labels, and even if it was tens of thousands of labels, you still, you could not use a model trained in that approach, in this discriminative approach to be able to guess a label that’s outside of the ten thousand labels that’s been trained on. But with CLIP, we get exactly that. So, with CLIP, you can just say, you could just ask it to label images that it’s never seen before in class categories it’s never seen before, but it uses, yeah, it uses this approach that you just described to map it to any natural language. 
David Foster: 00:58:17
Yeah, precisely. And it, you know, it’s, the reason it can work is because it’s encoding everything into the same latent space. It doesn’t matter if it’s not a label in the data set, you can make it a label by doing, by pushing it through the encoder, whether it’s an image or a text.
Jon Krohn: 00:58:29
Right. So, it’s a, the latent space, the meaning that is embedded in this latent space, we can extract that visually or linguistically. 
David Foster: 00:58:40
Exactly. And that’s what DALL·E 2 excels at it. It basically takes the text embedding from your input. So, say you’ve written something about, I want to see a cat riding a skateboard, then it takes that text embedding and tries to predict what the image in the corresponding image embedding looks like. That’s called the prior. And then the final step takes the image embedding and uses diffusion to generate the image. So, it’s like a three-step process. Text goes through the texting coder to create the text embedding, and that’s, that’s just the clip text embedding. You’ve then got a prior, which sits in the middle that says, now go and predict me what the equivalent image embedding looks like in the latent space of the image model. And then just decode it. I mean, I say just, there’s a lot of work that’s gone into that, but that is how DALL-E 2 works. 
Jon Krohn: 00:59:24
Nice. Okay. Super cool. So, this CLIP approach, great not only for associating natural language that wasn’t in the label training data but also great for allowing DALL-E 2 to be so much more effective than its predecessor DALL-E. And yeah, so, I guess we already talked about it. I was gonna ask you a question about how CLIP can be used for Zero-shot prediction, but I think we’ve already covered that. So, this idea of Zero-shot prediction is using a machine learning model, typically a large language model to be able to do some task that it wasn’t trained on and without any training examples at all. So, you just write, you know, you take the model weights as they were trained, and you say, “do this task.” So, you know you know, is there, is there a skateboard in this image and it can answer that question even if it’s never been trained to do that.
David Foster: 01:00:24
Precisely. Yeah, that’s exactly it. Yeah. Even if you’d never sort of shown it you’ve never given it that task before, it can have a good go or it. 
Jon Krohn: 01:00:31
Sweet. All right. So, all right, we’ve got lots of great foundational generative AI knowledge now under our belts. A really cool topic that we alluded to earlier in the episode is world models. So, you’ve got chapter in your book dedicated to it. What are world models and how can a model learn inside its own dream environment?
David Foster: 01:00:55
Yeah. I love this. I love this topic. This is so, fascinating to me. And it’s actually, the reason I started writing the book in the first place was a paper in 2018 by David Ha and Jürgen Schmidhuber, called just simply called World Models. And it’s effectively a, it’s like a collision between two of my favorite fields, which are generative AI and reinforcement learning. And in the paper they describe how you can build a agent. An agent is in reinforcement learning, something that takes actions within the environment. And the agent has within it the variational autoencoder that we’ve just talked about. And what that’s doing is it’s trying to collapse down what it’s seeing in the example, in the paper, it was a car racing in around a track. It’s trying to collapse that down into a latent space, which it can predict chronologically.
01:01:44
So, it’s now trying to model how its future looks, given its latent understanding of what it’s seeing and the action that it’s just taken. And this is where everything collides for me, because it’s like you’ve got the VA the variational autoencoder creating the latent space of the environment and understanding what it’s seeing. You’ve then got an auto-regressive model. They used an RNN – Recurrent Neural Network in the paper, which tries to predict autoregressively how that latent space will evolve over time given its actions. And then you’ve got reinforcement learning, which is an entirely different field, which then says, how do you, how do you take actions that maximizes the reward given the environment that you are in, is in your own hallucination of how this latent space evolves. And the latent space, of course, includes how the reward evolves over time and what kind of episode reward you’re gonna get. 
01:02:39
So, I love this field because a world model for me is, it encapsulates everything about machine learning that we’ve learned so far. There’s discrimative stuff involved, but also a generative component, a reinforcement learning component. And I think this is a really powerful concept in teaching agents to behave as in an environment with their own sort of generative understanding of how that world operates. Feels very close to how we do it as humans. You know, when, when we are learning a new topic, we’re not, it’s not really something that we expect the environment to give us a nice packaged up reward function for we seem to be able to have an inherent understanding of how the world operates and then layer on top our actions onto this understanding. So, if I’m shooting a basketball through the hoop, you know, I kind of know what’s gonna happen because I can imagine what the action is gonna do to my latent interpretation of what I’m seeing. And so, it makes me learn. I mean, I’m still terrible at it, but it in theory should make me learn a lot faster because I have, I have an internal representation. I’m not just operating on the pixel space of my, my eyes. Yeah. So, world models are the reason I wrote the book, really. So, I’ve got a lot to owe to them. 
Jon Krohn: 01:03:55
Super cool. All right. So, world models blend variational, autoencoders, autoregression, deep reinforcement learning to allow machines to visualize, to imagine, to dream some time steps into the future as to like what the most likely outcomes are given in a current state. And this allows it with the deep reinforcement learning component to then take actions that allow it to achieve some objective. And just to break down a few of the terms that you use there from reinforcement learning you talked about a reward function. So, and you also talked about agents. So, in a reinforcement learning paradigm, so, reinforcement learning has been around for decades, and a reinforcement learning is a class of machine learning problem really where you want an agent, that could be a person or it could be a machine to be able to take a series of action. 
01:04:55
So, a really big example of deep reinforcement learning in recent years is the AlphaGo algorithm by Google DeepMind, which was able to beat the world’s best Go player. So, this kind of thing where you have a board game where there’s a sequence of actions and you want the agent to be able to predict what likely actions are going to lead to winning the game of Go or winning a video game could be a Atari video game was very popular a few years ago for training these deep reinforcement learning algorithms. And oh, yeah, I should say that a reinforcement learning algorithm is a deep reinforcement learning algorithm when we use deep learning to solve the reinforcement learning problem. 
David Foster: 01:05:39
Exactly. 
Jon Krohn: 01:05:39
And so, I think that ties together all the terms. Oh, a reward was the last one there where, so, in reinforcement learning, we, let’s say we have it playing a video game, then we provide it with the pixels on the screen, and that’s like the state of play. But in addition to that, we have a reward function, which in video games is often really easy, which is why Atari video games were so, popular, a choice for tackling with deep reinforcement learning problems because they have an inbuilt score like Atari games, like all of them have a point score that we’re trying to maximize. And so, we feed that reward to the algorithm and it learns, okay, if I take this action, if I press right on the joystick or left on the joystick, is that likely to increase my reward in the future or decrease it or keep it flat?
01:06:28
And so, reinforcement learning algorithms are trying to maximize their reward. And so, your point there was with most reinforcement learning approaches, in fact, as far as I was aware, until this conversation, all reinforcement learning approaches, we had to have this reward function made explicit that the algorithm is trying to maximize. So, if we go outside of the video game scenario, once we’re, say, teaching an algorithm to drive car, we’d have to come up with, we’d have to manufacture some function. Like you get one extra point for every meter traveled toward a destination. But you lose a thousand points if you hit a pedestrian. 
David Foster: 01:07:13
Yeah, exactly. 
Jon Krohn: 01:07:13
And so, what you were just saying now is fascinating to me, because I think you said that with these world models, we can have a deep reinforcement learning model learning real-world problems without needing to specify explicitly what that reward function is. 
David Foster: 01:07:32
Yeah. It, it’s a, it’s a case of the world model itself doesn’t need the reward function. The world model is simply trying to understand how its actions can be used to effectively model and predict how the environment will move in future. Then the power of it is that you can layer on top of that a particular task. And of course, that task would have to have a reward function, but obviously this is a lot faster than just from scratch. Learning a reinforcement learning task from scratch with a reward function. It’s almost like the world model gets you 80% of the way there because you have an inherent understanding of the physics of your environment before you say to it now try and drive the car fast. So, in the paper, for example, what they do is they actually just train the world model completely task independently. So, there’s no reward. They just say, take some actions, observe what happens. So, drive the car forward, drive the car, left, drive the car, right, brake, and just see how, what your observation does. Like, don’t worry about going fast, just drive and see what happens like randomly, which just feels like what a baby does when it’s, you know, crawling around on the floor. And my eight month-year-old is, is doing this hopefully more and more every day until the point where we wanted to definitely not do this. 
Jon Krohn: 01:08:48
You’ve been, you’ve been raising a newborn baby the entire time you’ve been writing this book? 
David Foster: 01:08:53
Yeah, it’s been mad. I don’t, I don’t know. I, it’s sort a blur to be honest with you. I’ve, yeah. I think the thing that’s been sacrificed is sleep. So, yeah, that’s, that’s, that’s how it is. But yeah, I’m, I’m delighted to have a new daughter, but it’s yeah, she’s actually the de she’s dedicated in the book, the book’s dedicated to my daughter. So [crosstalk 01:09:14] 
Jon Krohn: 01:09:14
Yeah, the loveliest noise vector of them all. Alena.
David Foster: 01:09:20
Yeah. That’s the one. Yeah, exactly. So, she’ll be embarrassed by that in about 16-years time, I think, but hopefully by then generative AI would’ve yeah, maybe the hype would’ve died down. 
Jon Krohn: 01:09:32
Cool. All right. So, fascinating area. Now the final topic area that I want to get into, at least related to your book is GPT. So, to some extent [crosstalk 01:09:45] with this. I was like, should we even just be starting the episode with the GPT stuff? But I think that by going through these kinds of foundational concepts, it will allow us to speak in, you know, it will be able to get more into the weeds on GPT and, and how that relates to generative AI, generative deep learning than we could have if we just started with it. So, GPT generative pre-trained transformers, they have become by far the most widely known transformer model. In fact, I recently learned that OpenAI is trying to trademark term GPT.
01:10:21
So, these three letters, generative pre-trained transformer, so, the generative, obviously, like everything we’ve been talking about in this episode so, far, it generates something. In this case, it generates text at least for now, that’s all that it does. I’m sure that’ll change soon. And pre-trained meaning that it can do the kind of Zero-shot learning that we described. So, you it, it could perform lots of kinds of tasks. It’s trained on so, much data and it has such rich encodings of meaning that we can ask it to do something that it’s never encountered before, that nobody has ever thought to ask a machine or a person to do before and it can do it at least in terms of GPT-4 magnificently. And then, so, that’s G generative, pre-trained P and then transformer T. So, David, what is a transformer?
David Foster: 01:11:16
Yeah. So, transformers came into the world in 2017 seems like a lifetime ago, but it’s actually, you think about it, it’s only sort of 5, 6 years. And what they are based around is this concept of something called attention. And to understand transformers, you first of all got to understand what attention is because the whole transformer architecture is at its heart. Like the large majority of it is just how these attention mechanisms work and how you build them up together into what’s called multi-head attention. So, let’s talk about attention. First. So, attention basically is a, is a different way of modeling sequential data that is the complete opposite of the way recurrent neural networks do it. So, recurrent neural network says, okay, I’m take each token one at a time in sequence, and I’m gonna update my latent understanding of what this sentence or stream of tokens means so, far.
01:12:11
And then I’m gonna get to the end of the sequence and I will try to use that latent understanding to predict the next token. Because I’ve, I’ve built up enough understanding as part of this vector to do so. Okay. Attention takes this a different way. And it says, what you need to do instead is care about all previous tokens within your context window equally. Don’t try to maintain a hidden state, because as I’ll explain a minute, there’s a ton of problems associated with that. But instead, I want you to look at these previous tokens and first of all, make a decision about where you think the information lies that you need. So, instead of trying to like, incorporate all information from all tokens, the first step is to simply say, where do you want to look? And part of this model is about it, building up an understanding of where it needs to look for information. 
01:13:01
So, an example would be like the elephants tried to get into the car, but it was too, okay, big, right? So, the missing word there is something to do with its size. Now what are we using to do that? The word elephant is clearly important. Car is important because we need to understand what it’s trying to get into. But say it was the pink elephant, then the color pink is just irrelevant to this whole scenario. Having said that, if we change the context slightly and we say the pink elephant was trying to hide, then suddenly the color becomes all important. And like a pink elephant is probably harder to hide than a elephant you know, that’s, that’s a different color, a darker color. So, like the attention mechanism says, first of all, come up with a way of combining what you are trying to do, which is known as the query with all previous context tokens, which are known as the keys.
01:13:53
And a little bit like CLIP that we just talked about. It’s constantly comparing the key in the query and then pulling through a certain amount of information from that token, which is called the value, and combining it in a clever way using weights, matrix multiplication through weights into the next latent understanding, which is passed to the next layer and so on. And you build enough of these layers and you get such depth of understanding of the entire context of the sentence that you can mimic intelligence, it turns out. And that’s what GPT-4 does. So, yeah, that, that’s basically how attention works. And, and transformers really, you don’t need to know a lot much a lot more about them. They are just, there’s a few extra layer types like layer normalization, positional encoding, and so, forth, which tells you how to how to, how to basically tell you where in the sentence a particular word is. But ultimately what you got to know is it’s all about attention. 
Jon Krohn: 01:14:46
Attention is all you need, you might say. 
David Foster: 01:14:49
Precisely. Yeah. I mean, that punchy title is it’s still, yeah, it’s one of the biggest memes I think in all of deep learning, which is cool. 
Jon Krohn: 01:14:56
So, there are different kinds of architectures that rely on transformers in different ways. So, GPT relies heavily on the decoder part of a transformer. So, earlier in today’s episode, we talked about encoding and decoding in the latent space. So, encoding takes say text and or in your, in your analogies, you prefer the images, but whatever it takes a text so, like tokens of natural language or pixels of images, and it converts, it encodes that into the latent space. And then we have the decoder part of an autoencoder that decodes that that lower dimensional representation into some desired output. It could be, again, be text, it could be an image, it could be video, whatever, it could be code. So, similarly transformers encode and decode what they can, but in some architectures like GPT were reliant more on the decoder part, whereas other architectures like BERT, which came out a few years earlier but is still enormously useful for a lot of applications. It only has an encoder on its transformer. So, why would somebody want to encode only yeah, what are the key, these key differences between these encoder-based transformers versus GPT? 
David Foster: 01:16:14
Right? I, this is the biggest misunderstanding I come across with people when they’re talking about transformers is this differentiation between encoders and decoders and everything in-between. There are some architectures that have both the very first transformer actually, and I think this is where the confusion comes from, was an encoded decode architecture, which means it had both. And so, people now think that all transformers are still based around this initial architecture, and they’re not, like you rightly pointed out GPT is just decoder only, they dropped the encoder. So, what’s the difference? Well, there is one difference basically that you need to know about, and that is something called masking. A encoder like BERT doesn’t care where it pulls the information from in the sentence to have a contextual understanding about a particular word. It can look forwards in the sentence and it can look backwards. 
01:17:05
So, let’s say I wanted to understand then and come up with an embedding for the token elephant in that previous example. It can look into the future of the sentence and pull information from future context in order to come up with a realistic embedding for that word. A decoder can’t do that because if you want your model to produce and generate and go into the future, you can’t be reliant on future information to do that because it doesn’t exist yet. So, the only difference is that a decoder simply says “mask future information from every step of the process. Don’t ever pull information from the future. Only use where you are currently at to determine the next token.” And that is why you can use a decoder model for generation like GPT. But you can’t use an encoder model like BERT. BERT is for natural language understanding. It’s not for natural language generation. That’s the difference. 
Jon Krohn: 01:18:01
Cool. Yeah. So, NLU/NLG, so, BERT an encoder only architecture, we use it for natural language understanding. So, we can take natural language and we can encode it into this space. And then we can do useful things with that. We could train a discriminative model to be able to do interesting things with adding encoder. We could be using it to classify text. So, you know, is this, does this have a positive sentiment or negative sentiment? That kind of thing. Whereas these decoder-only architectures like GPT specialize in sequence generation. 
David Foster: 01:18:36
Yeah. And, and the thing you should use to determine which one you need is like, if you want to, if you want to build something on top of it, like a discriminated model, like you say, then you got to be looking at things like encoder architectures. If you want to produce a word like the next word in the sentence, look at decoder. Now there’s some examples like GPT-4 where actually you can do pretty good discriminative stuff using a decoder model, because you can just get it to output the predicted token. So, decoders are kind of ruling and dominating at the moment because they’re, they’re just incredibly powerful generalist learners. 
Jon Krohn: 01:19:07
Yeah. But, but you might be able to more efficiently, like if, if you want to be encoding the language to do, to do a classification task, you’re probably better, you could be probably more computationally efficient using an encoder-only architecture like BERT. 
David Foster: 01:19:22
Definitely. And there’s small versions of these things like distilled BERT, which you can fit on smaller hardware. So, yeah, I think you know, our first ber port a call if we’re we’re approaching this kind of stuff is always to go for the encoder models first and see how they do. Because you, you, you’re in sort of dangerous territory with decoders because you don’t actually know what it’s gonna produce next. You know, it’s whereas with an encoder you’ve got the vector so, you can do what you want with that. 
Jon Krohn: 01:19:46
Nice. And so, we talked earlier about music and how that’s kind of one of the more exciting areas for you. And it, while we do have some isolated cases of well-known music generation by AI, like there was a song which actually I candidly haven’t listened to, not really my genre but there was Drake & The Weeknd, two of Canada’s best-known musical artists. It’s actually wild to me as like, I’m Canadian and it’s wild to me, like how in like comedy as well as in actors in general and in music, how like Drake was like the most dominant person globally in music for years. And he’s Canadian and then he is replaced by The Weeknd, who’s also Canadian. But so, so, somebody took it upon themselves to generate an AI-generated track where Drake & The Weeknd appear together. And if my memory serves me, they like, they sing about being in love with Selena Gomez, 
David Foster: 01:20:48
Yeah. Something like that to be, I’m also in your boat. I haven’t listened to it. It’s not my genre either. So, yeah, I think that’s correct. Obviously I’ve heard of the story. And on briefly on the Canadian thing, I think there’s a guy called David Foster who’s really famous as well, like a musician. So, he’s like, every time you Google me, you just come up with a Canadian David Foster, which I quite like, to be honest. I can hide behind him. 
Jon Krohn: 01:21:08
All the musicians and all the David Fosters you need to track are Canadian. So, yeah, so, how can we be using transformers for music generation? I think they can play a key role in doing it well. Right? 
David Foster: 01:21:21
Yeah, definitely. I think, you know, the first part [inaudible 01:21:23] for everybody doing any sort of generative task these days is transformers and music’s no exception. In my book we cover this, so, there’s kind of, we go through the process of single track music where you’re looking to generate a single stream of notes. Because that in itself has problems because you have to care about not only pitch, but you have to care about duration. And unlike tokens, text tokens where you’re just dealing with a single integer and words come in discreet units like it’s one at a time. There’s no such thing as duration. In music you’ve got to care about not only where the note is pitched harmonically, but also how long it is. So, there’s a modeling choice to be made there about how you do that. You can, there’s a few ways of doing it. 
01:22:03
You can code up the duration as its own token. You can, you can basically model both streams in parallel and model it as almost like a dual stream of tokens. But ultimately you use the same ideas that you do from text modeling. So, you’ve still got attention where it’s looking back at previous notes and deciding what note to come next, you know, and it would make sense harmonically if like you’re in the key of D and you’ve got things that are all notes that are also following in the key of D. So, it’s the same idea, you know, there’s a grammar to music just like language. But then also we talk about polyphonic music, which means music that has more than one track at once. So, you’ve got a ton of challenges there. Like what do you do about parts that just drop out for a few bars?
01:22:45
Do you have to model, how do you model that if like two other parts continue, and two other parts drop out? You know, how, how do you model that? It’s no longer one stream of tokens, but you’ve got like maybe a four-stream token if you’ve got a quartet of musicians. So, there’s, there’s different ways of approaching it basically. One of the first attempts was something called MuseGAN. This was back in the day I think when GANs were all the rage. And you know, it was looking at how you can actually model polyphonic music as a picture. So, you imagine something is something called a piano roll, which is basically where you kind of draw the notes out and you can imagine you know, one of these like music boxes where it’s got punch, it’s like a punch card thing and you can almost see the music being fed in as a, as a picture. And then it, you know, you spin the little crank and it kind of makes some ballerina dance or something on top. But you can see that the way in which it’s being generated is effectively a picture of the music that’s being fed through. So, you can model it that way. But obviously transformers are now obviously yeah, making waves in music generation, even polyphonic music. So, yeah, lots of different options, but it’s always a modeling choice that you need to make up front as to how you approach it. 
Jon Krohn: 01:23:52
Super cool. I’d love to hear about all this music stuff. I’m really excited about it. Something that I, and I think this is the first time that I’m saying it publicly, but something that I’m really excited about doing is generating music where I’ll be involved. So, like you pointed out, if people have, there’s a guitar behind me, which people have, have watched the video version of the show before. They’ll see the guitars always there. And I actually can play guitar and sing. And there was an episode it was the year-end episode a couple of years ago episode number 536 where I ended it in a song and I played on episode. I can’t play guitar very well, but I am competent at like rhythm guitar to accompany my voice. And so, I have this idea for attracting really big name guests. 
01:24:40
Like I’d love to have Jeff Hinton on the show. We’ve had emails with him back and forth, but he’s always too busy. And so, something that like I’m hoping would get his attention, or if it doesn’t get his attention, I think at least it’ll get lots of other people’s attention is performing a song about Jeff Hinton. And I could use these kinds of, I, so, I haven’t yet experimented with the generative AI tools for music very much, but I have this idea that I could, it could enrich this songwriting process because I could have drums, a bass in the background, AI generated. So, anyway. 
David Foster: 01:25:17
Cool idea. You got to find something that rhymes with Hinton, I think [crosstalk 01:25:20] you’re struggling. Oh God- 
Jon Krohn: 01:25:25
GBT-4 will help me out, I’m sure. 
David Foster: 01:25:27
Oh, that’s true. Yeah. Good shout. 
Jon Krohn: 01:25:30
And yeah, and then very last technical topic for you is GAN so, Generative Adversarial Networks, we talked about them really early on in the episode. A few years ago they were the way for generating things. And so, I suspect I haven’t read your first edition, but I suspect your first edition was really heavy on GANs.
David Foster: 01:25:52
Yeah, definitely. Yeah, it is. I sort of see GANs as a lot in many ways the trailblazer because a lot of the techniques and the way in which we approached generative AI was founded through the GAN movement. And it, you know, there was like a GAN a week at one point and it was kind of a running joke that, you know, which GAN are we gonna be seeing this week to do some niche thing. And you know, I hear people now saying GANs are dead and I don’t know why you’ve included them in the second edition, and it should just all be about diffusion models and transformers and that’s it. And you know, I, first of all the GAN discriminator is still used in so many really powerful models. The concept of a discriminator constantly operating over the top of whatever you are using as the generator to distinguish real from fake and using that in the loss function is something that’s still very much alive today. 
01:26:43
Take a model like, it’s a model called VQ-GAN. So, Vector Quantized GAN and variations of this is still one of the most powerful image generation models out there. You know, it’s, it, it’s not the case that diffusion models are ruling the world just yet, and that there are, and StyleGAN-XL for example, is still incredibly powerful and is dominating a lot of the leaderboards. So, look, I never like to chase the latest like thing and say that this is it. And we’ve, all innovation has stopped. But I hope what people can sort of take from the book and also take into their own learning is that it’s good to have a general understanding of what’s come before, because you never know what’s gonna come next and where it might come back into fashion. So, it, there are super interesting idea that I think is gonna be around a long while. 
Jon Krohn: 01:27:31
And in addition you have a bit in your book about combining GANs with transformers. 
David Foster: 01:27:36
Yes. Basically what I would say to anyone looking to get into generative AI is, is look for the crossover between these fields because I, you know, whilst we bucket them up in the book into now we’re doing the GAN chapter, now we’re doing the transformer chapter, what I would say in general is a lot of the powerful models out there are actually, they’ve got components of all of them. So, like you mentioned there, there is a type of model in the book that, that effectively has a transformer within it in order to do part of the encoding of a piece of text. But there is a GAN discriminator in there as well. And you know, when you’re, when you’re looking at a lot of these multimodal models, they’ve got diffusion in there, they’ve got GANs in there, they’ve got Transformers in there, so, they’re using the right tool for the job. They’re not like just saying, well, I’m gonna use one model type and, you know, that’s all I’m gonna use in the model. Because a transformer is brilliant at modeling sequences, a GAN is brilliant at determining fake from real. Diffusion models are fantastic at working with very rich latent spaces that you can sample from. And the best models out there use all of these techniques. And I think they will do in the future as well. 
Jon Krohn: 01:28:42
Nice. Really exciting. What do you, briefly, as I imagine this could go for a long time, but briefly, what do you see as the future of generative AI? 
David Foster: 01:28:53
Yeah, that’s a huge question. I mean, in terms of maybe I just break it up briefly into kind of technological and societal. Technologically, I think we’ll continue to see the field accelerate and I don’t see any needs nor I guess application of a pause. I just don’t think it’s, it’s feasible to run something like this. So, yeah, the field will continue to evolve, but I think we’ll see more emphasis on the alignment side and less on the power. I think GPT-4 is plenty powerful for us for the time being, and I don’t think we’re gonna see GPT-5 be kind of like a huge kind of technological improvement over 4. But I think we’ll see it improve in terms of alignment and I think we’ll see it improve in terms of customer visibility. 
01:29:37
And just like, just the stuff that goes around productionizing a model like this, like user management and GPT for business I know is coming out and all of these that make it a viable product in the real world that we have control over. So, that’s one thing. And then societally, I think what we’re gonna see is, is widescale adoption of these tools. And I think like all good technology, they will be baked into the point where you don’t really know that you are using it. I don’t think people are gonna be for long going into ChatGPT to type in their prompts, but it will be baked into other tools in the market and we’re already seeing this like with, you know, ChatGPT integrations into different applications or wrappers around it. So, yeah, I think that the future is bright. I’m really optimistic. I’m excited by it and I hope everyone else is too because yeah, it’s just the best thing ever to happen to the machine learning field I think.
Jon Krohn: 01:30:29
Yeah, as a regular listener you’ll know that I’m a techno-optimist and certainly there are issues that we need to stamp out with any new technology, but really incredible things the last few months have been mind-blowing for me. GBT-4 has completely, like, I still, every day I do something new with it where I’m like, I can’t believe how well you do this thing. 
David Foster: 01:30:51
Yeah. I don’t know what you feel, but I feel I’m amazed at the number of people that haven’t heard of it yet. I know I live in my little bubble of like yeah, generative AI and, and data science and, you know, I, and yet you talk to a lot of people who just go, oh yeah, I think that was that thing that I saw in a BBC news article. And it’s just like, they haven’t even tried it. And I, we, I feel like I’m living in a really privileged position of having this access to this incredible technology before the rest of the world gets to see it. It’s amazing. 
Jon Krohn: 01:31:16
Yeah, I had a, I was just at the Open Data Science Conference East in Boston last week at the time of recording, so, about a month ago at the time of this episode being published and I gave a brand new half-day training on NLP with LLMs, and I focused a lot on GPT-4. So, how you can be using GPT-4 to automate parts of your machine learning model development including, you know, things like labeling but also just how you should be using it all the time in your life. Like, it’s insane to not be paying the $20 a month. Like is you’re like, I saved so much time. I was able to have so many more coding examples in that half-day training because whenever I ran into an error, I was like, oh man, tell me why I’m making this error and just fix the code and it does, perfectly. 
01:32:07
And, but I had this really surprising conversation with somebody after I gave that training. He came up to me at a drinks session and he said, you know what would it take for me to train something like GBT-4, but it works in Arabic. And I was like, I was like, I, it does that. Like, you just, just, you know, it’ll, you’re, you’re looking for something, you translate into Arabic. He’s like, yeah. I’m like, yeah, it does that out of the box. And then he reached out to me later on LinkedIn and said, “Sorry, I did, I didn’t ask that question right. Like, I get that you know, it can translate into Arabic, but like, what if I want to train an Arabic version of GBT four that can do everything?” And I was like, it is, like, why are you messaging me? Just try it like, like so, that, like everything that you want to do in English, you can just ask it in Arabic and it’ll ouput in Arabic. No problem. 
David Foster: 01:33:11
Yeah. It’s, it’s amazing, isn’t it? And like you say, the barrier to entry is so low, like, just set up an account and you’re away. It’s not even, it’s free. Like, just give it a try. Even if you are running with like 3.5 or whatever, it’s yeah. I feel like we’re, you know, we’re watching everyone walking around with candles. Well, I’m holding a light bulb going like, this thing’s really useful. 
Jon Krohn: 01:33:29
Yeah, yeah, yeah. So, yeah, so, really quickly beyond raising your daughter and your newborn daughter and writing this book, you also are the founder and you run a consultancy called Applied Data Science Partners. So, I guess really briefly tell us what the consultancy does, and I understand that you’re hiring, so, let us know, like, you know, there’s, there’s probably listeners out there who’ve been blown away by your, you know, your impressive depth of knowledge and your clear ability to explain things. No doubt you have a thriving consulting practice, so, there are probably people that would love to work with you. So, let us know what roles are open and what you look for in your hires. 
David Foster: 01:34:11
Sure, yeah. So, our consultancy Applied Data Science Partners myself and my Co-Founder, my amazing Co-Founder Ross Witeszczak started this six years ago with the vision really to deliver AI and data science in a way that’s practical and sustainable for businesses. Because we found, you know, at the time a lot of the, a lot of the practices were still very throwaway and kind of proof of conceptee. So, we set up the consultancy really to base data science and AI practices around best-practice software engineering at the time. So, containerization or continuous integration and all of these things that you expect from software engineering. We built this around data science. So, yeah, in terms of our client base, we work with large private institutions all around the world, but also public sector. So, we have such a broad range of work. It’s something different every month which makes us, I think, a really interesting place to work.
01:34:57
Yeah, and you’re right to say we’re, we’re hiring, we’re, we are always actively hiring and looking for the best people. There’s kind of a few different roles I would say that we hire for our bread and butter is data scientists. So, everything from people who are just finishing their degree, we look for people who are hungry to learn and hungry to get stuck in and really not shy away from difficult problems because we solve difficult problems every day for our clients. So, that’s, that’s, you know, the spectrum of people that we look for right up to obviously like leads and people that can lead projects and conceptually understand what a client wants. We got data engineers as well. So, that’s a different track of our business. They work closely with our data scientists to deliver solutions in a best-practice software engineering way. 
01:35:41
And then our analysts as well. So, we hire people who don’t necessarily have a background in, you know, what is traditionally called data science, but are just very, very good at explaining concepts to senior stakeholders. And so, we got our, our analysts as well internally. We do hire as well people like software engineers, so web developers, people who can build applications. As I say, because our consultancy is growing so, rapidly, we are hiring for kind of all of these roles. So, if you have any of those particular talents, we definitely want to hear from you. Tell us why you think you’d be a great fit for the company. 
01:36:15
Because what we look for above everything are people who, first of all are hungry to learn. Secondly, attention to detail is absolutely paramount for our business. We like people who can dive deep into a problem and not get scared by the weeds. Not everything is rosy. In business consultancy you get messy data, you get stuff that doesn’t work, you have to fix problems quickly. So, working for people who care about the detail. And thirdly just be a nice person. Like it’s really easy. Just, just be friendly, be optimistic, be positive, and you’ll find at ADSP you meet like-minded people who have the same attitude. 
Jon Krohn: 01:36:54
Nice. Sounds awesome. So, yeah you’re looking for people who are hungry to learn, not shy away from difficult problems, they have great attention to detail, they’re nice. And yeah, lots of great data roles out there. Data analysts, data scientists, data engineers. And what kind of like stack do you guys use? Python, I guess? 
David Foster: 01:37:12
Yeah, Python for pretty much everything that we’re building. So, take you through some tools and technologies. So, VS Code is our IDE of choice. In terms of kind of cloud, we are fairly agnostic. We work with what the client wants, but our recommended would always be a Azure. We work pretty heavily in that stack, so, looking for particularly engineers that have that on their CV. In terms of kind of machine learning models, we like to say that we use the tool that’s right for the job, so, we’re not gonna always go down the newer network route. You know, so, like XGBoost does the job most of the time or any of the variants like GBM, etc. But yeah, obviously for some projects, especially now we’re getting a lot of work through on generative AI, particularly GenAI Strategy.
01:37:53
We’re obviously using a lot more deep learning than we ever have done. But for fine-tuning, especially if we’re fine-tuning open-source models. So, yeah, I would say, you know, tech stack, we are fairly we don’t use anything out the ordinary, but also we’re very much aligned with what the clients want and we are kind of tech agnostic in terms of platforms. So, we use Tableau and Power BI for example. We’re not gonna sort of you know, if the client wants Tableau, we’re not gonna say you have to use Power BI and vice versa. 
Jon Krohn: 01:38:21
That makes sense. Awesome. David, this has been a sensational episode. I have learned so, much. It’s been so, nice to get in the deep in the weeds with you and hear so, much about your deep generative learning, whoops, Generative Deep Learning book. Yeah, such a fantastic book. Couldn’t recommend it strongly enough. And oh, I can’t believe I didn’t mention this at the beginning. So, our listeners who listened to the very end get a bonus treat here, which is that we’ve done this with O’Reilly authors on the show before when I, when I post on LinkedIn from my personal account, and so, we’ve had some people posting on like YouTube comments or on posts from the SuperDataScience account. No, this is to on my account on LinkedIn. And because we have to have in order for this to like work fairly, it has to just be one and that’s, that’s the one post that gets the most engagement each week when I announce these episodes. 
01:39:16
So, when I announce this episode, which will be in the morning New York time, usually around 8:00 AM Eastern from my personal account on LinkedIn the first five people who comment will get a free digital version of David’s book. So, Generative Deep Learning can soon be yours for free. And yeah, and something, I don’t know if I’ve mentioned this enough recently, but you also, if you happen to not win that contest, that race, then you can still get, you can get a 30-day free trial of the O’Reilly platform using my special code SDSPOD23, SDSPOD23. So, either way you can access the book you just don’t have it forever with that 30-day free trial. Nice. All right. And then beyond your own book, David, do you have a recommendation for us? 
David Foster: 01:40:16
Yeah, actually something recently that has really caught my eye is something called Active Inference. It’s it’s a concept that was originally I guess laid down by Karl Friston one of my absolute heroes actually in generative modeling. And it’s the idea that- 
Jon Krohn: 01:40:30
He wrote the foreword for your book, 
David Foster: 01:40:31
He wrote the foreword for my book, which I’m absolutely privileged and honored to say. He, so, Active Inference very briefly is it’s just a way of describing how agents learn in a way that addresses action and perception up as two sides at the same coin. It’s a very elegant idea and at the heart of it is a generative model. I’ll leave that as a dangling carrot for anyone who’s interested in this book because he, along with his associates have written a book called Active Inference. And there’s a subtitle the Free Energy Principle in Mind, Brain and Behavior. It’s one of my absolute favorites. It’s a very complex topic that is explained extremely eloquently. And it’s a very recent book as well. Actually, it was only published last year. And it’s basically the book you need on active inference if you’re gonna start learning about this fascinating concept. And it’s maybe something I would recommend once you start getting into generative modeling that you read because it’s a really interesting kind of theory of everything for intelligence and the mind. And yeah, it puts the action into perception if you like. 
Jon Krohn: 01:41:38
Wow. Very exciting. And so, for people who want to hear more about your brilliant thoughts, David, what are the best ways to follow you after this episode? 
David Foster: 01:41:48
There’s a few ways. So, you can follow me on LinkedIn. That’s probably the best way. You can find me on Twitter. I’m Davidadsp. Yeah, and by all means, like follow our company as well. We post loads of interesting stuff about data and AI. So, if you’re interested in just general updates, then feel free to follow Applied Data Science Partners on LinkedIn. 
Jon Krohn: 01:42:09
Nice. And you also have a podcast coming out soon, don’t you? 
David Foster: 01:42:12
Yeah, that’s right. We’re launching into the the space of podcasts. We can’t pretend that we’re gonna be anywhere near your quality of podcast initially, but like we we’re, we’re gonna be learning as we go. So, the podcast is called the AI Canvas and it’s a, it’s a podcast that focuses primarily on generative AI and its application to people. So, if you want to know about how generative AI is gonna impact loads of different professions, law, teaching arts, music, creative arts, performance arts we’ve got interviews lined up with people from a ton of different professions and their fears and also their great sort of hopes for the technology in the future. Because I think it’s really important we talk with everybody across the spectrum, not just those who are involved in the technical side, but specifically those that are gonna be impacted by the technology. So, we’ve had a few of the conversations already. It’s blown my mind how these people are have, are able to talk so, eloquently about the topic. And we’ve had some fascinating conversations already. So, do follow us on that. It’s podcast.adsp.ai. 
Jon Krohn: 01:43:18
Nice. All right, David, thank you so, much for taking the time today. Brilliant episode and I look forward to catching up with you again in the future, hopefully on air so, that we can sample your wisdom, which is very little noise in it. 
David Foster: 01:43:34
Oh, thank you. Jon, it’s been an absolute pleasure talking to you. I’ve had such fun, so, thanks and talk to you again. 
Jon Krohn: 01:43:44
Boom. What a gripping and educational conversation. In today’s episode, David filled us in on how discriminative modeling predicts some specific label from data. While generative modeling does the inverse, it predicts data from a label. He talked about how generative modeling can output text, voices, music, images, video, software, and combinations of all of the above. How autoencoders encode information into a low dimensional latent space and then decode it back to its full dimensionality. How variational autoencoders constrain distributions to produce better outputs than the vanilla variety. He talked about how diffusion converts noise into a desired output while latent diffusion, which operates on dense latent representations is particularly effective for producing stunning photorealism, such as in Midjourney v5. He talked about how world models, these super cool concept, have these blend variational autoencoders together with autoregression and deep reinforcement learning to enable agents to anticipate how their actions will impact their environment.
01:44:46
He talked about how transformers facilitate attention over long sequences, enabling them to be the powerful technique behind both natural language understanding models like BERT architectures and natural language generation models like GPT architectures. Finally, he talked about how GANs such as StyleGAN-XL still produce state-of-the-art generated images, but GAN show particular effectiveness when combined with transformers in multimodal generative models. All right. As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for David’s social media profiles, as well as my own social media profiles at www.superdatascience.com/687. That’s www.superdatascience.com/687. If you like book recommendations, like the awesome book recommendations we heard about in today’s episode, check out the organized tallied spreadsheet of all the book recs we’ve had in the nearly 700 episodes of this podcast by making your way to www.superdatascience.com/books. 
01:45:44
All right, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another profoundly interesting episode for us today. For enabling that super team to create this free podcast for you we are deeply grateful to our sponsors. Please consider supporting this free show by checking out our sponsors’ links, which you can find in the show notes. Finally, thanks of course to you for listening all the way to the very end of the show. I hope I can continue to make episodes you enjoy for years to come. Well, until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 
Show All

Share on

Related Podcasts