86 minutes
SDS 711: Image, Video and 3D-Model Generation from Natural Language, with Dr. Ajay Jain
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
In this episode, host Jon Krohn explores with his guest Ajay Jain, Co-Founder and CTO of Genmo.ai, how creative general intelligence could take the video industry by storm. They also discuss the models that got Genmo to this point, the applications of NeRF, and how understanding human psychology is so essential to developing models that output high-fidelity video.
Thanks to our Sponsors:
Interested in sponsoring a Super Data Science Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
About Ajay Jain
Ajay Jain is the co-founder and CTO of Genmo.ai, a San Francisco-based artificial intelligence company developing high-fidelity visual generative models that help people express their creativity. Genmo's tools help people create videos, 3D assets and images. Previously, he completed his PhD in AI at UC Berkeley, where he co-authored some of the most foundational research papers in diffusion models---the neural networks that enabled Stable Diffusion, Midjourney and DALL-E 2. At Google Brain, Ajay also worked on the first models capable of generating 3D assets like meshes from simple descriptions including a project called "DreamFusion."
Overview
Genmo.ai is a creative general intelligence platform and research lab that builds multimodal models to help users express themselves through high-fidelity videos. Ajay Jain tells Jon that producing creative content is complex, involving several steps, tools and people. Genmo facilitates creative exploration to help these people create visual content, however little technical knowledge they have, enabling them to realize their ideas via simple prompts.
Ajay notes that multimodal models are fundamental to developing powerful creative outputs, largely because we as humans interact with and experience the world through multiple sensory touchpoints. It made sense for Ajay and the Genmo team to develop a product that simulates this experience. He highlights one current capability of Genmo’s video synthesis models that can learn and animate geometrically consistent objects, allowing those who may not have the necessary skills to hold the creative reins while the AI takes care of all the technical layers underpinning the final product.
Jon and Ajay also discuss the capabilities of multimodal models, using the example of a model that has been trained on video and text performing better when describing visual scenes than a model that has only been trained on text. Ajay says this awareness of synergy across formats (text, image, video, etc.) has helped the team at Genmo create such a powerful model. They consider how far AI-generated video has come, from the capability to add simple animations to sections of video through animated video snippets to Genmo’s high-fidelity video synthesis, the latter of which sets the current standard for AI-generated video. Ajay also expresses his present research focus on ensuring consistency across multiple individual videos that can potentially be stitched together to make scenes for longer films. His aim is for Genmo’s models to account for previous conditions in generated clips to create a longer piece that works seamlessly as a whole creative production.
Ajay’s 3D modelling and object generation were partly due to his research into neural radiance fields (NeRF). He explains that NeRF represents a 3D scene via weighted neural networks. It essentially overfits a given 3D environment with content to visualize multiple rather than single perspectives. He also notes how his internship at Uber, where he was responsible for forecasting pedestrian behaviors, showed him the importance of understanding human psychology, which became critical to his research and development of creative general intelligence models.
Listen to this episode to get to grips with the technology behind Genmo, as well as Ajay’s tips on how to save money and energy while training models, how to succeed as an AI entrepreneur, and what Genmo looks for when hiring data scientists.
In this episode you will learn:
- About Genmo.ai and the term “creative general intelligence” [03:47]
- Why Ajay started Genmo.ai [09:26]
- The increased performance of multimodal models [21:12]
- All about Denoising Diffusion Probabilistic Models (DDPMs) [31:03]
- The application of Neural Radiance Fields (NeRF) [55:26]
- Predicting pedestrian behavior at Uber [1:01:50]
- How to save money in the process of training models [1:12:42]
Items mentioned in this podcast:
- Zerve
- Grafbase
- Modelbit
- Genmo.ai
- r/Cinemagraph
- DALL-E 2
- Denoising Diffusion Probabilistic Models
- DDPM GitHub
- Contrastive Code Representation Learning
- Checkmate: Breaking The Memory Wall With Optimal Tensor Rematerialization
- Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis
- DietNeRF GitHub
- SDS 707: Vicuña, Gorilla, Chatbot Arena and Socially Beneficial LLMs, with Prof. Joey Gonzalez
- Probabilistic Machine Learning: An Introduction by Kevin Murphy
- The Code Breaker by Walter Isaacson
Podcast Transcript
Jon Krohn: 00:00:05
This is episode number 711 with Dr. Ajay Jain, Co-Founder at Genmo AI. Today's episode is brought to you by the Zerve data science dev environment, by Grafbase, the unified data layer, and by Modelbit for deploying models in seconds.
00:00:20
Welcome to the Super Data Science podcast, the most listened-to podcast in the data science industry. Each week we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, Jon Krohn. Thanks for joining me today. And now let's make the complex simple.
00:00:51
Welcome back to the Super Data Science podcast. Today we've got the brilliant machine learning researcher and entrepreneur, Dr. Ajay Jain on the show. Ajay is Co-Founder of Genmo.ai, a platform for using natural language to generate stunning state-of-the-art images, videos, and 3D models. Prior to Genmo, he worked as a researcher on the Google Brain team in California, in the Uber Advanced Technologies group in Toronto, and on the Applied Machine Learning team at Facebook. He holds a degree in computer science and engineering from MIT and did his PhD within the world-class Berkeley A.I. Research (BAIR) lab, where he specialized in deep generative models. He's published highly influential papers at all of the most prestigious machine learning conferences, including NeurIPS, ICML, and CVPR. Today's episode is on the technical side, so we'll likely appeal primarily to hands-on practitioners, but we did our best to explain concepts so that anyone who'd like to understand the state-of-the-art in image, video, and 3D model generation, can get up to speed.
00:01:50
In this episode, Ajay details how the creative general intelligence he's developing will allow humans to express anything in natural language and get it. He also talks about how feature length films could be created today using generative AI alone. How the stable diffusion approach to text to image generation differs from the generative adversarial network approach. How neural nets can represent all the aspects of a visual scene so that the scene can be rendered as desired from any perspective. Why a self-driving vehicle forecasting pedestrian behavior requires similar modeling capabilities to text-to-video generation and what he looks for in the engineers and researchers that he hires. All right, you ready for this horizon-expanding episode. Let's go.
00:02:38
Ajay, welcome to the Super Data Science podcast. Delighted to have you here. Where are you calling in from today?
Ajay Jain: 00:02:43
I'm calling in from San Francisco, California.
Jon Krohn: 00:02:46
Nice. A popular choice for our AI entrepreneurs for sure, and particularly unsurprising, given that you were suggested to me by an amazing recent guest, Joey Gonzalez in episode 707. Wow, what an episode. That guy knows an unbelievable amount and acts so quickly, talks so quickly and in such detail. Yeah, really A+ episode and we're grateful to him for suggesting you as a guest as well.
Ajay Jain: 00:03:17
Joey's incredible. He and I have worked on some projects together while I was at Berkeley.
Jon Krohn: 00:03:22
And so we'll get into some of that Berkeley AI research shortly. But let's start off with what you're doing today at Genmo. So, you've co-founded a generative AI startup, Genmo.ai, and in that company, you provide an approach to creative general intelligence, which isn't a term that I have used before, I don't think so. Can you explain the vision behind Genmo and its role in advancing this creative general intelligence term?
Ajay Jain: 00:03:53
Yeah, absolutely. Happy to talk about that. So, Genmo is a startup I've been working on since December. We wrote the first line of code on Christmas and shipped it right after my PhD. Genmo is a platform and research lab where we build the best visual generative models across different modalities, whether that's imagery, video, or 3D synthesis. And the goal of Genmo is to allow anybody to express themselves and create anything. Now that's a huge, huge lofty vision. How are we going to break it down? How are we going to get there? We see creativity and creative content production as a very detailed, actually really reasoning deep process, which involves many steps, many tools, many people working together. It also involves a ton of creativity and inspiration.
00:04:44
And so we build tools and models that kind of get out of people's way and allow them to express themselves in an intuitive interface. And so what I mean by that is that we try to make the software not the bottleneck in your content production. So, let's say you want to make a feature length film or you want to make a movie, we want to produce tools that will allow you to produce clips in a controllable fashion. We also make software for 3D synthesis, so if you want to get down into the weeds and get a little bit more control in your pipeline, actually manipulate the underlying assets going into the product, whether it's a game or a video, you can get that out the platform too. And so Genmo is kind of, it's an all-in-one place where you can create all this visual content. We also are primarily in a research and development phase and doing a lot of work at the cutting edge of visual generative media.
Jon Krohn: 00:05:32
Very cool. So, yeah, you gave an example there that you'd like to be able to create, say film clips probably in the media term. And I've got some questions for you about kind of big picture and of feature length films later, but just to give our listeners a sense of what this creative general intelligence means, I guess. So, the idea is that as you express, people should be able to ask for anything and you just provide it. I know that a lot of your initial work is with visual things, particularly 3D renderings, but do you envision that eventually your, the creative general intelligence that you're working towards, would be able to create other kinds of modalities as well? Maybe there would be natural language stuff or audio stuff?
Ajay Jain: 00:06:20
Yeah, yeah, absolutely. So, let me break down what we mean by creative general intelligence a little bit. We see a really big opportunity in creating software and AI that can understand any type of media, whether it's visual imagery, video, audio, and text. The reason this AI needs to understand and consume that media is then it can understand the user's intent, it can understand the visual world that we as humans live in and kind of level up there. So, I think a lot of the AI models that are available today, like ChatGPT, are a little bit deprived of sensory inputs. They're text-to-text models. They can observe this really signal-dense, meaning-rich textual format and produce it as an output. The people, we are multimodal creatures, right? Our sensory density and touch first of all is underappreciated and incredibly dense, sight and sound, taste, smell.
00:07:24
At Genmo we focus a little bit more on the visual, the site side for consumption because there's a lot we can contribute there. Long-term, these models should be able to understand many different modalities. And I think one of the interesting things is that we know how to build AI that can be very general purpose. So, with similar substrate, with similar architectures, similar learning algorithms, and similar pipelines, we can build software that can, once it's learned one modality, expand into another one and start to consume data from there and learn how to reason over it. That's on the input side. So, one part of creative general intelligence is building artificial intelligence that can take in any modality and understand it correctly. The next piece is to be able to cooperate with the user in order to create visual content, and that's where the creative element comes in here. So, you've already seen that language models have a lot of capability and creativity. They can write poems, and stories, and jokes, they can even understand them as well. A lot of the work we do is around really high fidelity visual creation and I'm sure we'll get into that.
Jon Krohn: 00:08:26
Yeah, exactly, yeah. That fidelity is key to it. And that's something that we, in the past year there's been an enormous explosion in, it was in spring of last year, northern hemisphere's spring. I was giving a TED format talk and we'd had a few OpenAI guests at that time. And so I was able to get them to create some images using DALL·E 2. So, there wasn't public access yet, but I could provide them with a prompt, they would create an image and I would put it in my talk. And I had them pretty small on my slides and some of them look pretty compelling and they were small like that. But now just a little over a year later, we have stunning high resolution photorealistic images thanks to the kinds of approaches that you've helped bring forward. So, we'll get into that in a moment as well when we're talking about your research, this high fidelity stuff. But yeah, let's kind of focus on Genmo a little bit more right now. What caused you given the tremendous research experience that you have, the tremendously impactful papers that you've had, what was it on Christmas that caused you to say, now's the time for me to be jumping in creating my own AI startup?
Ajay Jain: 00:09:44
Absolutely. So, I've wanted to get products out to the public for a long time. One of my original motivations for working on AI was to kind of enable new products that weren't possible before where really the model enables a new kind of interface, a new capability. And the work we were doing in creative visual models felt extremely ripe for distribution. And actually, so we got started because it felt like it was taking too long for technology to get out of the lab and into the industry. The second reason was in order to integrate across different modalities more effectively than we could do in an academic setting. And so in terms of that first bit about it takes too long to get technology out. While it seems like generative AI is this very recent half year, one year revolution, taking the world by storm, a lot of these advances have been developing for many years, decades really.
00:10:45
And what I was observing academia was it would take sometimes three years to six months before advances we did in the lab started to get into the hands of creators where they could start actually using these models in their pipelines. In Genmo, we want to tightly integrate the model development in the research lab with product and enabling people to access that very quickly on a matter of weeks. So, in terms of Christmas timeline specifically, I mean what better time to get to hack and to build a product? So, we launched with a video generation product that we had queued up, but there was a ton of engineering to do to actually scale up the systems.
Jon Krohn: 00:11:30
Yeah, and then so what are the kinds of applications for this? So, with Genmo, you're able to create, and people can, our listeners can go to Genmo.ai right now and try it out, right?
Ajay Jain: 00:11:38
Yes, absolutely. You can go Genmo.ai, sign up for a free account and start creating images, videos, and 3D assets right away.
Jon Krohn: 00:11:46
And so what kinds of people would want to do this? Creatives, I guess. I mean guess it would be anybody that wants to be creating images. So, in any kind of scenario that you could be using Midjourney, you could equally be using Genmo, but then Genmo is also going to be useful tapping into that general aspect of this creative intelligence. It isn't just text-to-2D image, it's text-to-3D image. So, I guess it's probably relatively easy for people to imagine the kinds of applications, like for us on the podcast, we sometimes want to have a YouTube thumbnail for an episode topic. Like, we recently had one at the time of recording on Llama 2, and so we used Midjourney to create this Llama that's at a computer. And so that kind of thing is fun. But yeah, you already talked about video creation as well, which is something that is just starting.
00:12:45
So, in the same way that I was describing 18 months ago when DALL·E 2 came out, it was a little bit cartoony. It was in most cases obvious that this was generated, particularly if you looked at it in a higher resolution and now 18 months later everything's photorealistic. How, I guess, how much further are we away from that being the case with video where it goes from being a little bit cartoony, a little bit obvious, to being really slick, and then yeah, so that's one question, but then beyond that, the kind of the 3D renderings that you're doing, who is that useful for?
Ajay Jain: 00:13:28
Yeah, absolutely. It's been amazing how quickly you've been able to expand the quality in the visual space. And a lot of the methods haven't actually changed to be honest. It's the stuff that data scientists are used to doing, data processing, data cleaning, really careful analysis of metrics and the iteration loop that every ML engineer is familiar with. And in terms of the integration of video and 3D, I actually see this as very naturally coupled. We as humans take in raw visual sensory media and then we are the ones who do the decomposition into different assets. Like a mesh is not a natural thing. A mesh is a common graphics representation of a 3D asset. It's a representation we use in the computer to store an object. But at the end of the day, that representation is used in different media forms, whether you're using for a computer-generated imagery in a movie like Dune, whether you're using it in a game, whether you're using it for industrial design like in SOLIDWORKS. So, we all have these different representations of the physical world that humans have created in order to be able to manipulate them and to be able to efficiently render them into compelling visual media. A video is in some sense the rawest form of visual experience. It's easy to capture. We constantly consume it and it's kind of like a rendering of the real world, if you will.
00:14:51
And so there are many approaches that we could take to directly generate video and we work on a lot of those at Genmo and that will advance very quickly over the next few months. But there's also along that pathway to high fidelity video generation, there's opportunities to give people a bunch of value by synthesizing out these interpretable assets that they can inject into their pipelines.
Jon Krohn: 00:15:47
Tired of hearing about citizen data scientists? Fed up with enterprise salespeople peddling overpriced, mouse-driven data science tools? Have you ever wanted an application that lets you explore your data with code without having to switch environments from notebooks to IDEs? Want to use the cloud without having to become a Dev/Ops engineer? Now you can have it all with Zerve, the world’s first data science development environment that was designed for coders to do visual, interactive data analysis AND produce production stable code. Start building in Zerve for free at zerve.ai. That’s z-e-r-v-e dot a-i.
00:15:51
That makes a lot of sense. So, basically the kind of idea here is probably most people have seen this kind of thing where yeah, you have these meshes so you can imagine some shape like a vase. And that vase has this particular, a vase is actually kind of simple because often it's going to be, you could imagine maybe like a pitcher is a better example. Because, the pitcher has a handle maybe just on one side, and so you could use Genmo to, you know, say, create me a vase with one handle or with two handles and then it can create this mesh and I kind of imagine it, I don't know why it's black and green, green lines around this black object. And so it gives you this sense even though you're looking at it in 2D, if it's rotating, it becomes very clear that this is a 3D shape and you can see okay, when it spins around that one handle comes out and pops out at me and then it goes back around to left and then, oh, here it is on the left here, it's on the right. It's just like it spins around. So, you can create these 3D assets like the vase and then production company can use that in their pipeline for video production where you have this, the dune example, you gave, you know, tons of films today, especially action films have these 3D renderings. So, yeah, so it can be there, characters can walk around it and it will have this sense of being quite real because it's been 3D generated. It's not this video asset, it's this 3D object that is downstream converted into a video asset.
Ajay Jain: 00:17:30
Yeah, absolutely. And it's one of the things that we face a lot in machine learning is choosing the right representation to express the world. And so when you choose these representations, a lot of it's due to what's easy to process and what's easy to store and learn over. I tried to take a tack during my PhD, and also at Genmo to not just build the representations that are convenient for the modeler but also the representations that people actually want. And so oftentimes for many people, they want these assets that they can import into their existing software and manipulate, even if we're still early in generative AI.
Jon Krohn: 00:18:07
And then what you were saying is that these kinds of these 3D structures, then they make it easier to create photorealistic video, video realistic video where because you could have a collection of these objects and then there are combined together to create a particular kind of scene and I can see why that's like a really nice pipeline, but it sounds like you are also working on skipping that intermediate step and eventually I guess maybe through [inaudible 00:18:41], I guess the model will kind of have these latent representations of those just inherently where you don't need to specifically dictate, you know, that you want this vase of this particular look to be in this particular part of the scene and do that separately, you can ask in natural language for a scene with a vase in the corner, whatever next to the couch and all of those objects, they don't need to be rendered as this discrete 3D step because somehow the latent representation of the model just handles that seamlessly.
Ajay Jain: 00:19:14
So, absolutely we're going to be moving towards a future in which as the model grows, as the model improves in its capabilities, it's able to learn these representations of the world itself. And so we can already see this with some of the video models that Genmo's working on. We have some things coming out which should be out by the time this podcast airs where these video synthesis models will learn objects which are geometrically consistent. What I mean by that is it's synthesizing pixels directly, pixels straight out of a stack of frames, but the objects observed in those pixels appears like it's the same 3D consistent object across the course of that video. So, imagine it's a person rotating. As they rotate you would expect certain physical properties to be preserved such as conservation of mass, right? The amount of stuff in the frame should be about the same.
00:20:07
If you had taken the approach today of an interpretable pipeline, which synthesizes the 3D assets, you import that 3D character asset made on Genmo into Blender, rig them, animate them, you get that conservation a mass for free. They obey those physical properties that we expect because they've been built into the pipeline. And this is extremely useful because it allows you as a creator to directly take the reins. The AI is just there to help you in steps of the process. As we build models, there'll be people who don't know how to take that asset, that mesh and take it into Blender. They don't know how to rig it and animate it, but they still want to express themselves. For them, we're building these end-to-end trained video models - text-in or images-in, video frames-out. Today you sacrifice a bit of control by doing that, but it's a lot easier to use and over time will become the highest fidelity option.
Jon Krohn: 00:21:00
Very cool. Yeah, that's exactly what I was imagining. Thank you for articulating it so much better than I could. And kind of a tangential question that came to me as I was thinking about this is there have been examples recently in literature of multimodal models being able to do better in one modality because they have information from another modality. So, an example of this would be a model that is trained on natural language data as well as visual data might be able to describe a visual scene better than one that doesn't have any of that visual data training. And so yeah, maybe talk a little bit about that and how that kind of feeds into this idea of creative general intelligence.
Ajay Jain: 00:21:46
Absolutely. I think this is the trend we see across machine learning as an industry, not even specific to visual. That let's let the model learn and let's let the model learn its internal structure from data and let's train on as much data as possible that is diverse. In particular for visual modalities, there's a lot of synergy across different formats. So, training a video model is extremely expensive. By exploiting image data, we've been able to improve the efficiency of training and sampling those models quite a lot, making it easy for us to offer it for free to the public to get started. Integrating text data is extremely useful. So, it's one mode of control, which is a big reason to use text. Your users like to communicate in natural language. Another reason is it actually helps the model learn the structure of the visual world because it's being described in the training data and categorized in all these different categories for you in that textual input.
00:22:51
So, I see a future in building towards creative general intelligence at Genmo where we are building a research lab that is working across all these different modalities. A lot of people ask us, why don't you focus on one? Why don't you just focus on 3D synthesis? Why don't you just focus on video generation? And I tell them we are, we are building it in the way we think will be the most scalable long term, which is building models which can actually understand and synthesize the full 4D world. Full 4D meaning 3D and time with natural language controls in addition to visual controls for when people want to get really down into the weeds.
Jon Krohn: 00:23:29
Very cool. Yeah, that makes perfect sense and is very exciting. So, going back to a question that we touched on earlier where we kind of touched on this idea, it seems like we're now not far off thanks to technologies, like you're developing a Genmo to having consistency for video clips. I don't know, maybe you can give us a little bit of insight into, is it a few seconds, can we stretch? I don't know, what kind of time horizon can we get to now where we have consistent video and how realistic do you think it is that maybe in our lifetimes we'll be able to have feature-length films with dialogue, sound effects, music, all of that created at the click of a button with a natural language description?
Ajay Jain: 00:24:18
So, it's happening today. So, a few months ago when we got started, we had what I would call a flicker gram, something that it's more of a flickery style video, trippy and psychedelic and people love this for a lot of effects, perfect for water, gas, cloth rippling or hair flying. What we exposed as a product was that you would upload a picture, you'd select a region of the picture and you'd paint it in as a video. So, you'd select a region of your personal photo or your AI-generated photo and then animate that particular region of the photo and then people could do things like add butterflies or make the water ripple, make the waterfall flow. I think there's actually a whole subreddit called r/cinemagraphs that does things like this, these mostly still photos where parts of it are animating. And this got extremely popular and so this is one very low-level effect, subtle animation. We had some knobs where you could dial it up, dial up the chaos, but you lose coherence.
00:25:19
A couple of months after that we released a model for a coherent text-to-video generation where you put in a caption, you get a coherent clip out, but it would only be two seconds long, low frame rate, think about eight frames per second. And so this is interesting, but it wasn't yet at the quality that's actually really truly useful for people. So, they in fact preferred our first animation style, which could do subtle effects, but it could do it really, really well. We're continuing to improve that technology. This is really just a quality problem where people want coherent motion but they also want high fidelity and that means high resolution, high frame rate. The next model that is coming soon from Genmo does 24 frames per second or more really buttery smooth video, high resolution, full, higher than 1280p video synthesis. It's still short clips, so about four or five seconds, but we're moving that bar up over time. And really the bar is extremely high. Reality is a high bar to meet. In this industry, one of the things that I love, love about working on generative models is that the world just sets a really high standard for how high, how good we can get.
Jon Krohn: 00:26:30
Yeah, very cool then so what do you think about this idea of in our lifetimes having feature-length films that you could say you could just type in, I want a podcast episode of Jon Krohn and Ajay Jain, it's got to be 75 minutes long and they're going to talk about creative general intelligence, render it please.
Ajay Jain: 00:26:56
With a human in the loop we can get there, you know, arguably today, probably in a couple months much better. Because I was watching this TV show Dark, I don't know if you've heard of it, it's a slightly dystopian time travel-oriented TV show and just really beautiful cinematography. And I was just kind of carefully observing some of this cinemagraphic effects and I realized that almost all of the clips in each of these TV episodes is only a few seconds, max 10, maybe max, max 30 seconds in a single cut and then all these little clips are stitched together and that actually that has benefits. It allows the director to have some control over the story, allows you to zoom in on aspects that are of focus to what's happening in the narrative.
00:27:48
30 seconds is quite conceivable that Genmo is going to get there and there's another problem that's probably on your mind, which is okay, yes, each of these clips is 10 to 30 seconds, but they're all of the same scene. The content needs to be consistent across that film. That is a little bit of a problem for these visual generative models today. It's also part of the reason why we go for this integrated general type of architecture because we can imagine having a model, our same model that produced a video clip, not just condition on the text that you've typed in. I want a TV show about time travel, but it also conditions on the previous clips. And so at test time, just in the forward pass, this model can preserve the identities of the subjects. They know that it's Jon, they know that it's me Ajay in the past clip, so I should produce the same identity.
00:28:43
They know the style should be consistent and so on. I would say we're not there yet. It would take a lot of manual work and handholding by the user to produce at that TV show. I think that in terms of using these tools to produce clips that a human puts together with their creative vision, that can happen by the end of the year. I think now that's an ambitious timeline, but I think it's going to get to the point where we can start to have YouTube-level quality and full text-in to the full movie. It's a little tough to say. I think it's actually something that I leave more to the users to build those kinds of pipelines after we expose the individual tools.
Jon Krohn: 00:30:05
This episode is brought to you by Grafbase. Grafbase is the easiest way to unify, extend and cache all your data sources via a single GraphQL API deployed to the edge closest to your web and mobile users. Grafbase also makes it effortless to turn OpenAPI or MongoDB sources into GraphQL APIs.Not only that but the Grafbase command-line interface lets you build locally, and when deployed, each Git branch automatically creates a preview deployment API for easy testing and collaboration. That sure sounds great to me. Check Grafbase out yourself by signing up for a free account at grafbase.com, that’s g-r-a-f-b-a-s-e dot com
00:30:09
Yeah, that's obviously a much bigger stretch. There's that consistency over very long time windows and creating a good story and all the subparts, but I don't know, it's like if you'd asked me that a year ago, I'd say maybe it's not possible in our lifetime today you asked me that, I'm like, that's going to happen. I don't know how long it's going to take, but 10 years seems like a long time in AI terms now. So, yeah, very, very cool. So, a lot of your work today is inspired by research that you've done in the past. You're a prolific for how young your career is and how young you are. Things like your h-index, so like this marker of how many citations you have across how many papers you have is very high. And so perhaps the most well-known of all of your papers so far is on Denoising Diffusion Probabilistic Models, DDPM. That came out in 2020 and this paper laid the foundation for modern diffusion models including DALL·E 2 and the first text to 3D models, like the kinds of things that you've been talking about so far at Genmo.
00:31:21
So, can you explain DDPM, these Denoising Diffusion Probabilistic Models and how they relate to other kinds of approaches like Generative Adversarial Networks, GANs, which were the way that people were generating images and videos primarily until recently, and perhaps, I realize I'm asking lots of questions here, but perhaps most importantly is tying all of this into stable diffusion, which people are, many of our listeners are probably aware of in Midjourney. That's the approach that they use and how it relates to your own work.
Ajay Jain: 00:31:59
Yeah, I'm happy to talk about the DDPM project. Let me give a little bit of context to lay the landscape. So, this was a paper that I worked on in 2019 and 2020 early in my PhD at Berkeley. I did my PhD with Pieter Abbeel, who I believe has been on this podcast before.
Jon Krohn: 00:32:19
He has indeed, yeah.
Ajay Jain: 00:32:21
And I was very excited about the promise of generative models, but it felt like we had an incomplete picture, an incomplete set of tools. Where we would take a lot of handholding, a lot of really careful tuning dataset curation to even get something to barely work in visual media. Let me give an anecdote. So, there was a project I was working on for Inpainting, and the idea here was taking models like GPT that generate one word at a time, instead trending them to generate one pixel at a time in an image. If you did that kind of thing. These models would only work at low resolution and secondly, they wouldn't be able to do edits, they wouldn't be able to do something like replace some object in an image. Let's say you have that vase with a handle on a table and you want to remove it, clean up the background.
00:33:11
So, even if we could train these visual models to generate one pixel at time, they would work out let's say 32x32 pixels, little tiny, tiny images, but we wouldn't be able to use them for manipulations. So, I did a project in that space where we tried to scale it up and we tried to make them more flexible at editing. We succeeded to some extent to both of those, but there seemed to be a hard limit where to get to the interesting levels of resolution like 256x256 pixel images, still pretty low resolution, but you can make out what's in the image. The model would take many minutes to sample an image. It's going one pixel at a time. You can kind of imagine as you scale up the resolution, you're quadratic expanding the amount of time it takes to generate that image. So, compute became a problem. The second problem with using GPT-style models to generate images is that they would kind of become unstable and stop generating coherent content somewhere along the line. So, if you have [inaudible 00:34:10] like thousands, tens of thousands, hundreds of thousands of pixels, the errors would accumulate where you would s*** up one pixel, the model would sample something wrong, it's supposed to sample a blue pixel, but all of a sudden it sample is a white pixel.
00:34:24
That error would start to propagate and the model would soon start losing the capability to understand what was going on and produce more of a blurry mess. In addressing that computational challenge of tens of thousands of pixels, we asked the very obvious question of can't we just generate multiple pixels simultaneously? GANs, Generative Adversarial Networks were able to do that. We started to work on Markov chain Monte Carlo or MCMC samplers that could in parallel synthesize all of these pixels. That project started to evolve over time into the Denoising Diffusion Probabilistic Models paper.
Jon Krohn: 00:34:59
All right, nice. So, yeah, so you're tackling this problem. You're realizing that maybe by predicting multiple pixels at once, you're going to get better results in single pixels at a time. You're realize that Markov chain Monte Carlo, MCMC sampling could be effective for that. Yeah. So, how does that thing go on? How does that lead to this kind of stable diffusion approach?
Ajay Jain: 00:35:18
Yeah, so once you start to sample multiple pixels at a time, the landscape is actually wide open for the architectures you can use. The GPT-style models, these ones that generate one token, one pixel at a time, have some constraints in them. They're only allowed to look at the past because as the model could look at the future, the rest of the pixels, the rest of the sentence, it'd be able to cheat trivially, just repeat its input. And so you have to constrain the architecture. With DDPM, we realized that we could train a model which can see all the pixel simultaneously, but now there's another problem that, how do we actually synthesize an image? So, with our [inaudible 00:36:00] model, it's clear you generate one pixel at a time. You look at the past ones, you generate the next one. In DDPM, we have a slightly different architecture where you start with pure noise.
00:36:08
So, the image is all noise, and the question we ask is if we learn a Denoising Autoencoder and an autoencoder that can look at this noisy image and strip out a little bit of noise, people use this for video processing and fine green noise removal. Is that Denoising Autoencoder actually capable of generating an image, not just removing a little bit of noise? Turns out the answer is yes. We found this, my collaborator, Jonathan Ho, found this project from Joshua M. Susskind from 2015 called the Diffusion Probabilistic Models paper that builds a Markov chain that maps from Gaussian noise and iteratively de-noises it in order to produce a clean sample.
00:36:51
There were a lot of architectural limitations in that diffusion probabilistic models paper from Joshua that early work in 2015. It didn't produce that high-quality samples. So, in the DDPM project, we ended up making a lot of improvements to that framework, greatly simplifying it, changing the way we parametize the model, coming up with new neural network architecture that worked a lot better. That neural network architecture is actually extremely similar to the architecture that stable diffusion uses. One of the key things we did was re-weighting the loss function. And so not to get into too much technical detail, but essentially when you do this denoising process, it turns out that there's most of the things that people care about are high-level structure. What is the object that there's an object kind of in the middle of the image as I'm looking at you right now, Jon, on the screen, there's a person in the middle of the image, there's a guitar in the background. These are high-level semantic concepts. Those are most salient to people.
00:37:49
And so part of what we did was re-weight our loss. So, that would focus the model on learning these high-level things. Denoising really high noise levels, so images that have extremely large amounts of noise in them, learning how to remove them. You might ask removing noise from image, how does that allow you to synthesize an image? The answer is at sufficiently high noise levels, you actually need to fill in the missing content in order to remove noise. So, let's say you have an image that is so noisy, you can barely make out what it is. You can kind of make out that there's a circle in the middle of the image. If you're able to actually strip out that noise, you're forced to learn what that circle might actually be, that it might be a face. So, this formed the foundation for DDPM where this insight that learning how to remove from an image could allow us to synthesize an image. A lot of that project was improving the architecture to make it actually happen. Allow that model to learn this. Because in order to learn to de-noise, you need to learn everything about what an image can look like and that's hard to pack into a neural network.
00:38:51
One of the interesting things that happened with that project is that it turned out to be a very stable loss function in the end. All you do is you take an image during training, you add noise, and you learn how to remove that noise.
Jon Krohn: 00:39:01
Nice. Yeah, this makes a huge amount of sense. This idea of iteratively removing noise while using a cost function that prioritizes the kind of salient objects in the image. And does this kind of stability, I'm wondering, this is generally, this is something that I know nothing about. I'm mostly working in the natural language world, so my machine vision stuff is relatively weak. Does this kind of stability help reduce hallucinations or is that something that is a big issue in machine vision?
Ajay Jain: 00:39:38
Yeah, so I think hallucinations are definitely a problem in visual synthesis. They take a slightly different form. So, text and natural language is really dense on meaning. So, there's stylistic things that it's coherent English, it abides by the grammar. So, that's kind of what you might think about these low-level details visually. Objects aren't at the very ends of the color spectrum, super saturated. They may change smoothly, images change smoothly. There are sharp gradients around object boundaries. Those gradients are smooth, like a curve or a line. These are low-level things akin to the grammar of language. These higher-level semantic concepts are what we start to notice when we talk about hallucinations. The model is saying things that are coherent English, it's perfect text, but it's incorrect and it's just made up, right? Well, with a visual generative model, we have exactly the same problem, and oftentimes it's actually a boon. A hallucination is a good thing because it allows people to create stuff that has never existed and never could exist in the real world. So, fantastical combinations of two characters. We've seen people do mashups of animals and vehicles, so like a furry car or a bat swimming or some things like that. People with their hair on fire to create incredible artistic effects.
00:41:07
And so these are hallucinations in some sense that they don't exist, but they are what people want a lot of the time. So, we need to build general-purpose models that are able to generalize to those different semantic concepts. Another form of what might be called hallucination is that the model is getting some of these semantic ideas wrong, getting some of that grammar wrong. So, let's say it generates, one thing we saw when we scaled up to high-resolution faces is that the model would generate an eye with one color like blue and another eye with a different color like green. And so this can happen in reality, but in the vast majority of the training data, people have the same color eyes, and so the model's actually under-fitting that distribution and hallucinating something that shouldn't be correct.
Jon Krohn: 00:41:54
Very cool. I like that analogy between the grammar of natural language and the details of visual imagery. That makes a lot of sense to me at an intuitive level. Awesome. Yeah, all the stable diffusion stuff, obviously making huge impact in your own work as well as in these kinds of popular tools like Midjourney, and it's wild to see how this is going to continue to get refined with the kinds of things that you're doing. These 3D representations, longer video, really exciting, the area that you're working in. It must be really satisfying for you to kind of have a new architecture and then you probably hit your head on the wall, your team for weeks or months at a time, sometime trying to get something to work and then you finally crack it and you're like, wow, you get these stunning, this whole new level of visual capability, you're really working in an exciting nexus of AI.
Ajay Jain: 00:42:50
Yeah, it's been fascinating. And seeing this explosion of creativity that's been enabled by these models has been extremely satisfying. I'm very happy about it.
Jon Krohn: 00:42:58
Yeah, for sure.
Ajay Jain: 00:42:59
I think, let me give one anecdote about this. So, once we had identified some of these architectural building blocks that were really scalable, things were actually very easy to extend and to build in new settings. So, here's an anecdote for that DDPM project. We built a model that could synthesize faces on the CelebA, a celebrity face dataset. It generalized to more in the wild faces out of the box. These are things that GANs could do though. GANs like StyleGAN were able to synthesize high-fidelity faces even higher fidelity than we were doing. However, what they couldn't do is be able to be transferred out of the box to a new setting. They needed months of engineer time tuning the hyper parameters, calibrating it. They were extremely unstable. What we did is we took that same model, same architecture, same loss, same hyper-parameters, and just swapped the dataset from 30,000 phase images to 1.5 million images of different settings like cats was one of the data sets. Another dataset was churches, and another one really interesting was like a million image dataset of bedrooms, LSUN bedrooms. These are photos of the interior of people's bedrooms.
Jon Krohn: 00:44:44
Deploying machine learning models into production doesn’t need to require hours of engineering effort or complex home-grown solutions. In fact, data scientists may now not need engineering help at all! With Modelbit, you deploy ML models into production with one line of code. Simply call modelbit.deploy() in your notebook and Modelbit will deploy your model, with all its dependencies, to production in as little as 10 seconds. Models can then be called as a REST endpoint in your product, or from your warehouse as a SQL function. Very cool. Try it for free today at modelbit.com, that’s M-O-D-E-L-B-I-T.com
00:44:47
I've seen that even from years ago with relatively early GANs in the first year or two of GANs. I remember that dataset as being a really mind-blowing moment for me because I could show you could move through the latent space of the GAN that was creating these bedrooms and it would slowly shift the perspective on the bedroom or slowly change the color of a bedroom wall. Yeah, and so that was actually one of the first, that dataset was really important for me in realizing how crazy things were becoming in AI.
Ajay Jain: 00:45:16
That's awesome. Yes, these latent space interpolations are so fascinating. They reveal some of the interior structure of the model and I think it's also very humanistic because an interpolation is the first step to a video in some sense. A video is very particular kind of interpolation if you want to look at it like that.
Jon Krohn: 00:45:35
It's like an interpolation over time.
Ajay Jain: 00:45:37
Yeah, interpolation over time where each frame is similar to the last one, but they change in a particular structured way. And so those are bedroom interpolations that you probably looked at. They don't change in these physically preserving ways. The conservation of mass is definitely not there. But one of the interesting things about the LSUN dataset is that it started to generate photos of art on the walls. And so we generate these pictures of someone's bedroom and people have art on their bedroom walls that they purchase online and hang up. And so you could actually see in the model of the samples, there would be little pieces of artwork just hanging on the wall of someone's bedroom sampled from the image and this bedroom doesn't exist, that art doesn't exist, but it would be there in the image nonetheless. And the second thing that's really interesting is that that just worked out of the box, right? Same hyper-parameters. It worked well enough. I did a little extra tuning. I doubled the model parameter count after we submitted the paper. So, then in our rebuttal, when the reviewers come back with the critical feedback, we can say, oh, actually we are now better than the GANNs. We forgot to include that experiment.
Jon Krohn: 00:46:46
Nice. Yeah, very cool. Yeah, that makes a lot of sense. That anecdote is super helpful for me being able to understand how this kind of technology allows you to move more quickly than folks who are working on GANs, where the hyper-parameters on that are really tricky, because really quickly for our listeners, when you have a generative adversarial network, the adversaries, you have two different neural networks and one of them is this discriminator that's judging the quality of the work and trying to figure out which ones, you basically have, you have your real image dataset and then you have this generative neural network that creates fake images and that discriminator is tasked with trying to figure out which ones are the real images and which ones are the fake ones. And then you backpropagate through both of them together where you only change the weights in the generator so that your generator starts to be able to figure out what kinds of images do I need to create to dupe the discriminator into thinking that they're real and it's getting the hyper-parameters right on the learning rates of that discriminator network versus the generator network are really tricky.
00:48:01
Yeah, I've certainly fought with those myself. I've done GAN stuff. I haven't done this more, yeah, these more stable approaches that you're working on. So, in the same year that you published this Denoising Diffusion Probabilistic models paper, this DDPM paper, you and your brother Paras Jain, you co-authored an article called Contrast of Code Representation Learning, and that's another big paper of yours. How does that paper relate to everything that we've been talking about so far?
Ajay Jain: 00:48:34
Yeah, so this was hearkening back to some of my research origins where I used to work on compilers. Out of chance in undergrad, found a professor working in compilers and started working in that area and got very interested in performance engineering. I got to Berkeley and I was working on these language models and visual synthesis models and I realized that code was a ripe area where if we could learn neural networks that could understand code, we would be able to use them all over in our developer pipeline, scratch our own itch, fix bugs for us, detect issues, write types, automatically summarize complicated hairy code bases because researchers don't write the cleanest code. If I could get an AI to summarize some of that, it would be helpful.
00:49:16
And so ContraCode was a step in this direction of a kind of new way to learn representations, neural representations of software. It was a very small community at the time. This area has exploded in popularity with the advent of Copilot and OpenAI Codex, models that can synthesize code, but it's a little bit of an old field with a lot of work. And so ContraCode, this project I worked on with Paras, who is also my co-founder by the way, we were working on a method that could represent code in a way that's more robust than past approaches. By robustness, what I mean is that let's say you take a function that needs to do something like sort an array, there are many different ways to implement that same function. That function can be implemented with different algorithms. It can be implemented with different variable names. It can be implemented with comments or without comments, but for the purpose of neural network, these should have the same functionality. At the end of the day, they implement the same functionality, and so ContraCode was a method that would be structured to learn the same representation regardless of how you implemented the function, as long as it had the same functionality. In doing that, we were able to demonstrate a lot more robustness against little changes that people would do. Little changes that would completely change the behavior of your learning algorithm.
Jon Krohn: 00:50:40
Very cool. And so yeah, an early effort at these kinds of code suggestion tools that now are abundant and make it so much easier and so much more fun to write code. It's been game-changing for me. I can do things so much more quickly, particularly with slightly older libraries that are already embedded into the model weights of GPT-4. Then it's just so easy to be like, I got an error, just please fix it. And actually you do get really helpful learning steps along the way so you can really now just dive into, you're like, "Oh, cool, there's this package that I want to learn how to use. I've got this project that would allow me to learn that package," and you can just dive right into it. You have this instructor that is able to really help you out and in a really friendly way. I love how friendly GPT-4 is with me when I make mistakes. It's so like, "Oh, I can see why you did that, good effort. If you just make this tiny change"-
Ajay Jain: 00:51:46
Working with a human, it's tough to work people. And so I think OpenAI has done a great job calibrating some of the expectations of the model against people's expectations.
Jon Krohn: 00:51:59
Yeah, yeah, for sure. It's so friendly. But yeah, related to this, how do you think these kinds of code reading or I guess even natural language reading models, what are the implications for security analysis or the potential vulnerabilities in adversarial settings?
Ajay Jain: 00:52:20
There are tons of vulnerabilities. I mean, you can dramatically affect the behavior of neural code or neural language model by changing the phrasing of your question or by introducing extra tokens and text into the context. And it's not always clear that scaling the model addresses all these vulnerabilities. Some of them are pretty fundamental and there always will be avenues to attack and affect the behavior of the model. If it wasn't possible to affect the behavior of the model, it wouldn't be able to understand what you write because fundamentally, the rest of your code base or the rest of your text expresses some meaning that the model needs to understand. So, it has to be sensitive to that. However, there are certain classes of changes that we can make our models more robust to. Things like this expressing different functionality, subtle perturbations to the data. We can make our models more robust to that little by little. Some of this is through data by training on more diverse data. Some of it is architectural in cases where you can make it a little bit more robust.
00:53:25
I think a lot of the work we did at ContraCode on loss function changes that would make the model more robust turned out to not really be required at scale. Where at scale, by scaling up our data, a simpler loss function can learn similar things. Because by seeing a hundred different implementations of Merge Sort the model will automatically learn similar representations. Even with the GPT objective. If you have a smaller data set or you want to fine-tune your model and specialize it to a new circumstance like your enterprise's data objectives like ContraCode's objective become pretty useful in that you can handle having a much smaller dataset yet still retain that robustness.
Jon Krohn: 00:54:11
Yeah. You saying it like that, are there people using this commercially now or are there open-source implementations that are easy for people to use if they want to be doing that kind of fine-tuning on their own enterprise data with ContraCode? Can they do that today?
Ajay Jain: 00:54:24
So, we have an implementation of ContraCode. It's research wear, so we haven't touched that repository in a couple of years that's on GitHub. It's possible that people are using some of these ideas, but I don't personally know. One of the core things there is around a data augmentation approach for programming languages where we do data augmentation by recompiling the code into a different format automatically. So, if anything is being used out of it, I would imagine something similar for code language models could be used in the industry, where you take your dataset, which [inaudible 00:54:57] a small fine, fine-tuning data set, there's off the shelf compiler tools, which will rewrite that code into a hundred or a thousand different ways, and you could train your model on that augmented set.
Jon Krohn: 00:55:08
Very cool. It's amazing how many different areas of generative modeling you've touched on in just in these relatively few number of years that you've been doing research. It's wild. Another area that is super fascinating to me, this is something that you were doing before Berkeley, is at Google Brain you worked on Generative Neural Radiance Fields. So, the short form for that is like NeRF, like the NeRF guns, so N, capital N, lowercase E, capital R, capital F. So, Neural Radiance Fields, and so these generative neural radiance fields, generative NeRF, these allow 3D seen representation, which obviously we've talked about a lot already in this episode. How did the NeRF stuff lead to the kind of 3D object generation stuff that you're doing today? Yeah, I'd love to hear a bit more about that.
Ajay Jain: 00:56:04
Lemme give a little bit of background on NeRF. So, NeRF is a Berkeley homegrown paper that came out around our DDPM paper as well. What a NeRF is is it's a representation of a 3D scene in a neural network's weights. So, with normal neural network, you picture a model which can take in data and then output some predictions and generalize to new data. So, during training time is trained on certain input output pairs and it can generalize at test time. A NeRF is actually extremely overfit. It's basically it's like a JPEG where you take some neural network representation of a scene and you pack in visual content for a particular scene in the world into that network.
00:56:48
You can sort of imagine, let's say you have this 3D environment, a room, an indoor house. You can represent that with a bunch of photos. You can represent it with these meshes that I talked about. Those are the meshes, a human created representation. A NeRF would automatically learn a representation of that scene that stores the colors. The wall is white, the guitar is yellow, but it would also store some elements of geometry at this coordinate, X, Y, Z, there's a lot of matter. There's something here that's absorbing light. In this area, this other area, X, Y, Z is free space. So, that's really, it's a lookup function mapping from X, Y, Z coordinates in space to color and to density telling you the amount of matter in that space. That's what a NeRF is. I think of NeRF, kind of like that JPEG. It's just a representation of a 3D scene that allows you to interpolate really nicely to new camera poses. Unlike a JPEG that expresses a single perspective, with a NeRF, if you train it, you can now run it at a different perspective and interpolate between the different input images. And so how does this connect to generative media?
Jon Krohn: 00:57:52
Yeah. How does that create, so, I mean it is obvious the kind of connection there. So, you are using a neural network to store information about a scene so that regardless of where in the scene or what angle in the scene you want to render, all the information that's needed is there in the NeRF representation. And so it's actually, it's pretty obvious to me how that's useful for the kinds of applications that we were talking about much earlier in the episode where if you want to be rendering scenes for a film or you want to be rendering 2D images of some 3D space, this is going to allow you to do that. I guess my question is how does that NeRF work relate to the kind of work you've been doing more recently at Genmo, for example, or at Berkeley? I suspect that there's some kind of connection, there's some kind of continuity and improvements over the years.
Ajay Jain: 00:59:02
So, I think all of this is connected to this vision of creative general intelligence. These are different instantiations. What I see as that general purpose creative model, some of that model learns how images work by denoising images, it learns what's the content of visual worlds, but doesn't know anything about motion, it, doesn't explicitly know anything about 3D geometry. We also train models on video. Those models know more about geometry because they see objects moving, they see cameras moving, it knows about how objects move, so it learns some interesting things. Then we develop a lot of algorithms that can take these general purpose visual priors that have learned how the world looks and how the world works in an abstract level and distill them down into something low level like a NeRF. I call NeRF low-level because again, it's just storing the contents of the scene. We need these really powerful generative models that learn how the world works, and then have powerful algorithm that we develop at Genmo to distill these visual priors into this interpretable 3D representation. And so that's kind of like a post-processing step to take this foundation model we are developing and then extract out, not an image, but extract out a 3D scene.
Jon Krohn: 01:00:17
Very cool. That's awesome. So, yeah, so they're related in the sense that we're still talking about reconstruction of a real-world scene, but the NeRF stuff doesn't actually generate, it's a map for regenerating something that has already been conceived, but the Genmo stuff that you're working on today, yeah, Genmo could equally output pixels or this NeRF representation and that NeRF representation would have much more flexibility in the sense that somebody could take that NeRF representation and render it however they like. So, kind have, it's actually, it's kind of interesting, it ties in maybe with that idea that you were talking about. If we wanted to have a bunch of different shots in the same scene in a film or a TV show and you want to have that consistency scene over scene, this kind of NeRF representation could be perfect for that.
Ajay Jain: 01:01:12
Yes, yes, absolutely. By synthesizing out the sample that's 3D consistent, you're 3D consistent by default. When you go ahead and render a camera trajectory, and a camera trajectory rendered out is a video. Now, there's still a gap here. I think when we come to video and motion where these NeRFs are static, they don't actually express motion, but this is some of the things we work on Genmo. We built a foundation model and then for different customers that want a particular format, we build these algorithms that can extract out that really high-fidelity version.
Jon Krohn: 01:01:46
Very cool. And so kind of going even further back into your career history all the way back to five years ago, which seems like, yeah, forever in AI time, you were working at Uber, so I guess that was an internship, but you were there for a while. That was like a nine-month internship or something, and so you were working in their advanced technologies group, ATG. You were a machine learning researcher there working on self-driving cars, and specifically you were forecasting pedestrian behaviors. So, this is super cool. It's obvious to me why this is so important. If you want to have a self-driving car, you need to be able to predict if there's somebody walking on the sidewalk and they're walking parallel to the side, they're in the same direction as the sidewalk. That's a very different kind of signal to somebody standing on the sidewalk and walking into the road. And so it's, without having spent any time on this kind of self-driving problem myself, it seems intuitive to me that that kind of recognition, having an AI system that can recognize that and notice it in advance and say, okay, that's 200 yards away, that pedestrian, but that pedestrian behavior of stepping into the road, that's a very different kind of signal to dozens of other pedestrians that are just walking along the sidewalk.
Ajay Jain: 01:03:17
Yeah, this is really a proving ground for developing these really capable foundation models that know how people behave and to synthesize video. At Uber ATG, we weren't interested in synthesizing pixels. What we were interested in forecasting behavior, and the reason you need a forecast behavior is to be able to plan. There's multiple steps in the self-driving pipeline. There's the sensory inputs, there's perception of how the world is at this instant. Where are all the objects? What is their spatial relationship? What is that object? Then there's the problem of forecasting, how are those objects going to interact and behave? Whether those objects are static, like the traffic light, a car, or whether they're people interacting with each other. Once you have this complete picture of the past, the present, and the future, now you can run planning algorithms and robotic stacks to try to predict a safe trajectory for your vehicle to navigate that world.
01:04:17
But predicting the future is really critical, right? I think people do this inherently. When we walk down the street, you see someone coming towards you. You don't want to bump into them. You need to be able to predict how they're going to behave, because if you assume that they're just going to stand still, there's no problem. We built generative models, honestly, not that different from GPT that could predict how people will behave over a course of time. Whether they're going to stay on the sidewalk, whether they're going to turn to avoid you, whether they're going to cross the street. And this was a really interesting problem space that got me hooked on this problem of forecasting the future and learning generative models of the world. But it still remains to this day a very important research area and one of the core problems in self-driving.
Jon Krohn: 01:05:06
No doubt. So, when we think about this, when you're thinking about pedestrian behaviors, does that involve, yeah, I guess, actually, no, I just kind of answered my own question. I was thinking, does that include if they're in a car, I mean, I guess if they're stepping out of a car, that starts to then kind of become, it bridges this world, but I guess you'd otherwise you'd probably have some kinds of models of vehicle behavior where even though there's a human in the vehicle, they're not a pedestrian because pedestrian by definition is somebody walking. I think whatever the Greek or Latin root is right there in the "ped" part of the word. But yeah, there's this interesting transition state between being in a car and having that vehicular behavior and then becoming a pedestrian, which actually might be particularly tricky to bridge. But yeah, I don't know. I haven't really asked a question, but you might have something interesting to say anyway.
Ajay Jain: 01:05:59
Yeah, yeah, yeah. No, I think that's some great observations. It feels this arbitrary distinction, right, between different categories of objects. Why are we having different interns working on different categories of objects? Does it make sense? Well, yeah, I would argue it doesn't make sense. It makes sense because the challenges are different. It makes sense because a pedestrian moves much slower, so you need architectures that are much finer grain, look at a smaller region of the world, it makes sense because pedestrians have social interactions. There also more agile, but vehicles have constraints on how they can move and they operate over longer distances. So, there's some practical reasons why they're different, but conceptually, as a person, we share a lot of the same machinery for understanding whether a lot of the same underlying neural machinery for understanding whether a biker is going to behave a certain way, whether a pedestrians going to behave a certain way.
Jon Krohn: 01:06:47
Exactly.
Ajay Jain: 01:06:48
A swimmer, a kayaker, or a car. A lot of that machinery is shared substrate, and this gets back to creative general intelligence that we should be learning models, foundation models that can understand all these categories of objects and people and behaviors in one unified way. Then if we need to extract a certain subset of that capability, like, we just need a really good pedestrian predictor that can be read off of the same underlying model. And I think this is the direction the industry is moving towards. It's more general-purpose models, but the transition is slower because they have different practical constraints that need to be met.
Jon Krohn: 01:07:27
For sure. Yeah. There's safety issues in that kind of world that are different from in your world, if a pixel gets rendered incorrectly, maybe you just don't use that sample and you generate another sample. But in driving, it's not like you can't generate another pedestrian.
Ajay Jain: 01:07:49
One chance.
Jon Krohn: 01:07:51
Cool. So, yeah, we've covered this really interesting arc from the Uber ATG stuff where you were forecasting pedestrian behavior, and even though you're not rendering pixels as an output, the model needs to be able to represent the world in a similar kind of way to when you're rendering video like you are today. From Uber, you have the NeRF stuff where you have these neural representations of 3D scenes. At BAIR, you have the ContraCode project, the stable, well, I mean, I don't mean to say the diffusion model stuff where you're more stably rendering visuals with that kind of algorithm relative to GANs, and yeah, all of that together brings us nicely towards the creative general intelligence stuff, the truly groundbreaking work that you're doing at Genmo today that is state of the art. Over all of that time over your entire research career, is there a paper that you're most excited about or most proud of? Is there some kind of research contribution that really stands out to you? Maybe one that we haven't covered yet?
Ajay Jain: 01:08:56
That's a good question. I would say there's the impact metrics, the Denoising Diffusion Probabilistic Models paper was an extremely impactful project, and I'm proud that the community has taken notice and grown around it. That took a lot of time, because I think, it made me realize when something is really new, it's one thing to move a benchmark, it's another thing to change the way people work, and it took quite a lot of time to change the types of methods people used in practice. So, that's obviously one of the most impactful things. Another thing I'm proud of is this work in 3D arc that we talked about. We talked about NeRF. That was a sequence of two years of trying to make 3D synthesis better and better, and that culminated in this project to Google Brain DreamFusion, where you could put in a caption and get out a high fidelity object diffusion models.
01:09:51
In terms of one project that I like that I think is a little under-appreciated, I had a hard time staying away from compilers. That's partly why ContraCode happened. There was another compiler project in there around, called Checkmate. Came up with a flashy name Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. This is another project with my co-founder and Brother Paras that he led. And the idea here is that as we're training all these neural networks getting bigger and bigger, learning all these things about the world, it's getting really hard to train them from the systems perspective. And so that project tried to make it so that us researchers with just a couple of GPUs at the time before the startup and in academia could actually train big models. How do we do that? So, we learned a system, we trained a system to reduce the memory consumption of these neural networks a lot. So, you could take a model which would take let's say 80 gigabytes of GPU memory and train it with only 16 gigabytes of GPU memory. Now your lab can save tens of thousands, hundreds of thousands of dollars on hardware. That work has had some impact. Now, it's a very common technique for people to use this idea of reducing the memory consumption with some of these algorithmic techniques, but because it's a systems paper, I don't talk about it too much, focus more on the ML side.
Jon Krohn: 01:11:19
But this is something that immediately seems super interesting to me. So, I mean, I'm at a small startup. We're trying to spend as little as possible, but train the biggest models that we can. So, an approach that we use regularly, especially for model serving, which can be way, many more orders of magnitude more expensive than your training because you only need to train the model once, but once it's in production, hopefully you have a lot of users and they're going to be calling that API a lot, and you're going to end up spending tons of money on inference. So, 99% of your cost in training and running model is going to be on the running part.
01:12:01
So, you're specifically talking about the training problem here. But even there, I would love to be not having to rent as many beefy GPU servers in Google Cloud and be able to train more cheaply, because that also it allows you to then iterate more quickly, iterate more recklessly. If you're not too worried about an experiment, maybe being throwaway, you can do more experiments and some percentage of those are going to end up being really high reward risks that were taken. You mentioned that some people are implementing this kind of stuff. What can I be doing today to be training with training models with a smaller memory consumption footprint and saving some money?
Ajay Jain: 01:12:50
Yeah, there's a lot of things. So, some of this technology have trickled down into deep learning frameworks. Your local deep learning framework has some functionality. What we had in that project was a way to optimally select certain layers to compute twice. When you're training, you typically run each layer once during a forward pass, you run each layer once during a backward pass to actually compute updates. We said that you can recompute some of these layers, run them twice, so it increases the computational cost, but in doing so, you don't need a store. The output of that layer could delete it. And it turns out that GPUs have gotten a lot faster, but their memory footprint, the amount of memory they have has increased more slowly. So, that's a perfect trade-off to make.
01:13:37
So, even if you do have a high memory GPU, you might not be utilizing it fully. You might want to train with a bigger batch size, so it's still worth it to make this trade-off and recompute some layers. You can double your batch size, triple your batch size, use half the GPUs that way. If you want to get started there PyTorch has a function called Checkpoint. Checkpointing, not storing the checkpoint of the model, but rather it's a function that you can wrap certain layers of your model with to recompute them during the forward pass. The downside though is you have to manually select which layers to recompute, and that takes a little bit of black magic. Our project, again was research wear that we never ended up fully integrating, with deep learning frameworks would select those layers for you automatically in the optimal way.
Jon Krohn: 01:14:21
Oh, wow. Wow. It sounds like a big opportunity there. Maybe there's some listener out there who can take advantage of the open-source research wear you've already made and incorporate that in. Yeah, that'd be really cool. Wow, awesome. So, yeah, those are some great practical tips. Yeah, increasing your batch size to make sure you're taking advantage of the hardware that you, hardware that you have. Yeah. And then this PyTorch layer checkpointing sounds like a great idea. More generally getting into some general questions now. It is mind-blowing to me the breadth and the impact of the achievements that you've had in your career already. What kinds of tips do you have for our listeners who would like to become tremendous AI researchers or AI entrepreneurs like you are, and maybe something that would help us frame your guidance on that would be how did you end up deciding to do this for a living? What you've been doing, how did this all come about?
Ajay Jain: 01:15:28
Well, I wanted to be a designer for a little bit back in college, but I failed [crosstalk 01:15:34]
Jon Krohn: 01:15:34
Graphic designer?
Ajay Jain: 01:15:35
Sorry, yeah, like a graphic designer. I went to, there's this design studio called IDEO, big consultancy. They had a advance in Massachusetts where I was doing college, and I started to do some of their events. I started to take drawing classes and architecture classes, media design. I really loved it. Then I tried to get a job as a designer internship at Figma, actually, in case there's a Figma listener out there, I could only get a job as a frontend engineer, and I love that stuff, love doing it, but I thought, this isn't going to get me to be able to express myself visually as much as I would like. So, I ended up focusing more on research to make it easier to make graphics. I think what I always found helpful getting started in research was to have kind of a toy problem, a challenging problem in my mind at any given time, whether that would be something like "How do we make high fidelity imagery?", let's say that's the problem.
01:16:38
I would go into the classes I was taking or go into the conversations I was talking with other researchers with that framing in mind so I could recast different tools I heard about, then figure out how I could apply it to that problem of choice. And 99% of the time, the things you hear about aren't useful for your problem, but in that 1% of the time that it is useful, it could be the make or break. And it also kept me motivated through all this stuff that wasn't connected to my core problem, to wade through all these classes and the PhD and on by being motivated by a core problem, which was getting human-level high fidelity synthesis. Another tip would be to do, I think to actually implement the difference between something not working, completely failing and then getting stunning results is sometimes a couple of implementation decisions and really well-tuned software. So, if you're an engineer and you're a little bit intimidated by the AI, the stats and the math, do know that really good engineering skills are critical to making these kinds of systems work, and so it's worth investing in those and worth getting your hands dirty because AI software is just another type of software.
Jon Krohn: 01:17:53
Awesome tips for sure. Being able to really get your hands dirty with the software is going to make you a much better AI researcher or data scientist for sure today. Especially as datasets get bigger and bigger, our models get bigger and bigger. Even the DevOps around being able to train these models gets more and more complex. The better your software skills are, the better off you are for sure. And that relates to things like the kind of hiring you're doing. I know that you're hiring engineers and research engineers, and when I asked guests on the show, if they're doing any hiring, they're almost always hiring engineers, but open data scientist roles like this kind of standalone data scientist, those are relatively rare. So, yeah, I couldn't agree more. Also, I've got a question for you about the hiring that you're doing, but before I get there, I think that there's an interesting thing that I just want to point out here, which is kind of cool to call it explicitly, which is that you wanted to become a graphic designer, you couldn't get the job that you wanted, and so now you've created an AI system that does the job automatically.
01:19:03
That is, you're able to, you've now created AI systems that already exist today and people can be using for free, but in the coming months, in the coming years, they're going to get even more and more and more powerful that allow people to take a natural language input and do graphic design. So, yeah, you've really showed them.
Ajay Jain: 01:19:24
Well, the way I like to think about it [inaudible 01:19:26], the reason I didn't get the job is I probably was not qualified. I was doing all this coding, and I think what it does is it lets people like me who want to create just don't have the skills even after many classes, to be able to start to create, whether it's a hobby or whether it's for work. And that's what I think is the beautiful thing. Technology and art are closer than most people think, and by leveling up the technology being enable new forms of art, we also enable new forms of work. But yes, it was easier for me to write code, to make visual content than for me to actually create. You don't want to see some of those early proto drawings.
Jon Krohn: 01:20:14
Very nice. All right, so yeah, so back to the hiring thing. Yeah, what do you look for in the people that you hire? What makes a great, I mean what you're doing at Genmo to be right at the cutting edge there where the things that you come up with at your startup are driving what's possible in the world in 3D rendering and video rendering. I'm sure there's tons of listeners that would love to be working in a company like that. What does it take?
Ajay Jain: 01:20:42
Yeah, definitely. We are ramping up or hiring a lot right now, so there's a bunch of opportunities. You need to be able to work in a fast-paced startup environment. You need to be excited by some of this vision. I think one of the core things we look for at Genmo is people who are able to see an ambitious future and make forward progress on it, break it down into steps. And so we set vision, but we trust our people a lot in order to solve really hairy problems. I think that's one of the things that research does, but engineers are very familiar with this seemingly insurmountable problem, figuring out how to break it down and break through those walls. We are hiring for product engineers, front-end engineer, full stack infrastructure engineers, on the research and development side we are also recruiting for our Oceans team. Our Oceans team is responsible for large-scale data infrastructure. In particular curating and improving the quality of data sets that we use in our models. We are hiring additionally for research scientists. I know a lot of research scientists from my network, but if there are people looking for research scientists roles in this podcast, we are actively growing our research team as well.
Jon Krohn: 01:22:03
Awesome. All right, Ajay, this has been a tremendous conversation for me to be able to enjoy. I'm sure our listeners have as well. Before I let you go, do you have a book recommendation for us?
Ajay Jain: 01:22:16
Yeah. I worked in Kevin Murphy's team back at Google and I have to give a shout-out to his book-
Jon Krohn: 01:22:21
Oh, really?
Ajay Jain: 01:22:23
Probabilistic Machine Learning. Have a new edition out this year. Great guide, great textbook. They have one of the [inaudible 01:22:33] from our paper, so I have to recommend it. I also liked the Code Breaker, the biopic on Jennifer Doudna and the CRISPR development. It points out a lot of the subtleties and technology development, the people behind the technology, the scientists, and as well as the impacts of the work they do, and it's a huge topic to keep in mind as we work on AI.
Jon Krohn: 01:22:58
Nice. Great recommendations. Kevin Murphy's an earlier edition., I haven't read this year's edition yet, but an earlier edition was certainly helpful for me in my machine learning journey. An excellent machine learning textbook. Cool that you were able to work with him. It happens in this industry. It is smaller than you think, but it's wild to think those kinds of names like Kevin Murphy, to me it's just like a name, but to you it's like this person.
Ajay Jain: 01:23:26
He's a great guy.
Jon Krohn: 01:23:28
Awesome. So, yeah, so if people want to be able to follow your thoughts after this episode, where should they follow you?
Ajay Jain: 01:23:35
Yes, so if you want to reach out about the roles, you can email hi@genmo.ai. You can follow me on Twitter @ajayj_, I post stuff. @genmoai is our handle on all social media platforms as well.
Jon Krohn: 01:23:50
Nice. All right, thank you Ajay. This has been such a great episode. I can't wait to see where the Genmo journey takes you next. Maybe in a few years you can pop back on and fill us in on those full-length movies that you're rendering.
Ajay Jain: 01:24:05
Love it. We can generate the visuals for this episodes.
Jon Krohn: 01:24:09
Sounds great.
Ajay Jain: 01:24:11
All right. Take care.
Jon Krohn: 01:24:17
Truly incredible what Ajay has accomplished in his young career already. There's a terrifically impactful future ahead for him for sure. In today's episode, Ajay filled us in on how his generative models can create cinemagraphs by allowing you to automatically animate a selected region of a still image, how his diffusion approach laid the foundation for well-known text to image models DALL·E 2 and the first text-to-3D models. He also talked about how 3D generations are useful for video editing pipelines today, but increasingly models will be able to go effectively from natural language directly to video pixels. He talked about how in the coming years we'll likely be able to render compelling 30 seconnd video clips, making it even easier than it is today to stitch together a feature-length film from generated video. And he filled us in on how Generative Neural Radiance Fields, NeRF enable neural networks to encode all of the aspects of a 3D scene so that perspectives of the scene can be rendered from any angle.
01:25:11
As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Ajay's social media profiles, as well as my own at superdatascience.com/711. Thanks to my colleagues at Nebula for supporting me while I create content like this Super DataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the Super Data Science team for producing another horizon-expanding episode for us today. You can support this show by checking out our sponsor's links, by sharing, by reviewing, by subscribing, but most of all, just keep on tuning in. I'm so grateful to have you listening and I hope I can continue to make episodes you love for years and years to come. Until next time, my friend, keep on rocking it out there and I'm looking forward to enjoying another round of the Super Data Science podcast with you very soon.
Show all
arrow_downward