Podcastskeyboard_arrow_rightSDS 624: Imagen Video: Incredible Text-to-Video Generation

7 minutes

Data ScienceArtificial Intelligence

SDS 624: Imagen Video: Incredible Text-to-Video Generation

Podcast Guest: Jon Krohn

Friday Nov 04, 2022

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn


Imagine all the internet videos… On this week’s Five-Minute Friday, Jon Krohn investigates Imagen Video, Google’s latest model for making video art out of text prompts.
 

Imagen Video is the latest model in Google’s expanding toolkit. Recently published, this text-to-image converter competes against already strong competitors on the scene like DALL-E 2. Unlike DALL-E 2, Imagen goes one step further—when you enter text, the model returns moving images or ‘time-based media’.

The videos themselves aren’t especially high quality—yet. They look like something out of the mid-90s internet; psychedelic flashes and splashes on the screen. But they are undoubtedly clever interpretations of the text prompts given to them. And progress toward higher-quality visuals continues to move swiftly.

In this episode, Jon breaks down the way that Google’s researchers achieved this new feat in algorithm engineering through a combination of a T5 text encoder, base diffusion model, and interleaved spatial and temporal super-resolution diffusion models. If you’re looking to play with the tool then you’re out of luck: The team behind Imagen Video decided not to release the model or source code due to the potential for creating abusive content. Subscribe, and we’ll keep you informed of the latest developments in AI and ethics, and whether tools like Imagen Video will ever be able to be released for everyone to use safely and responsibly.

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

ITEMS MENTIONED IN THIS PODCAST:

DID YOU ENJOY THE PODCAST? 
(00:03): This is Five-Minute Friday on the new Imagen Video model.

(00:19): In previous episodes of the SuperDataScience Podcast, such as #570, I’ve discussed DALL-E 2, a model made by the research outfit OpenAI that creates stunningly realistic and creative images based on whatever text input your heart desires.

(00:36): For today’s Five-Minute Friday episode, it’s my pleasure to introduce you to the Imagen Video model published upon just a few weeks ago by researchers from Google.

(00:45): First, let’s talk about the clever name: While pronounced “imagine” to allude to the creativity of the model and the users who provide text prompts to it, the Imagen model name is a portmanteau of the words “image” and “generation” (spelt I-M-A-G-E-N), which is rather sensible given that the model generates images. Get it? Image Gen. The original Imagen model was released earlier this year and — like the better-known but perhaps not better-performing DALL-E 2 — the original Imagen model generates still images. The new Imagen Video model takes this generative capacity into another dimension — the dimension of time — by generating short video clips of essentially whatever video clip you prompt it to generate.

(01:38): For example, if you prompt Imagen Video to generate a video of “an astronaut riding a horse”, it will do precisely that. If you prompt Imagen Video to generate a video of “a happy elephant wearing a birthday hat walking under the sea”, well, then of course it will precisely create that video for you too! In the show notes, we’ve provided a link to a staggering 4x4 matrix of videos created by Imagen Video that I highly recommend you check out to get a sense of how impressive this model really is.

(02:10): Under the hood, Imagen Video is the combination of three separate components, which I’ll go over here in succession. So this part of Five-Minute Friday is going to be pretty technical. The first component is something called a T5 text encoder. This is a transformer-based architecture that infers the meaning of the natural-language prompt you provide to it as an input. You check out episode #559 of the SuperDataScience podcast to hear more about transformers, which have become the standard for state-of-the-art results in natural language processing and increasingly in machine vision too. Interestingly, this T5 encoder is frozen during training of the Imagen Video model so the T5 model weights are left unchanged by training — this means that T5’s natural language processing capabilities are thus used “out of the box” for Imagen Video’s purposes, and that’s pretty cool, shows how powerful and flexible T5 is, like many of these large language models with transformers in them are. Ok, so that’s the first component, that’s the T5 text encoder which is used to understand the natural language prompt we provide to Imagen Video.

(03:18): The second component is something called a Base diffusion model, which creates the basic frames of the video. This works similarly to the popular “autoencoder” architecture in that it deconstructs an image into an abstract representation. In the case of Imagen Video this abstract representation looks like TV static. And then the model learns how to reconstruct the original image from that abstract representation. Critically, the base diffusion model of Imagen Video operates on multiple video frames simultaneously and then improves further on the coherence across all the frames of the videos using something called “temporal attention”. Unlike some previous video-generation approaches, these innovations result in frames that make more sense together, ultimately resulting in a more coherent video clip. So, that’s the second component of Imagen Video. The first one was the T5 text encoder, which understand the meaning of the natural language prompt we provide as an input, and then the Base diffusion model through its ability to convert an abstract representation into an image it takes the information from the T5 step, the first step, then it converts that into a number of frames of video.

(04:40): The third and final step of Imagen Video is taking those frames and making them high resolution. So this is done with something called interleaved spatial and temporal super-resolution diffusion models. These work together to upsample the basic frames that were created in step two by the base diffusion model to a much higher resolution. Since this final stage involves working with high-definition images (much more data), the memory and computational complexity considerations are particularly important. Thus, this final stage leverages convolutions, which are a relatively simplistic operation that has become a standard in deep learning machine vision models over the past decade, so uses that convolutional operation instead of the more complex temporal attention approach of the earlier base diffusion model. All right, so a quick recap one last time of the three separate components of Imagen Video. T5 text encoder understand the natural language, the Base diffusion model takes that natural language representation and converts it into simple images and then finally, the interleaved spatial and temporal super-resolution diffusion models take the simple frames and convert them into a high resolution ones.

(05:55): Now that you know how Imagen Video works, you might be dying to try it out yourself. Regrettably, Google hasn’t released the model or source code publicly due to concerns about explicit, violent, and harmful content that could be generated with the model. Because of the sheer scale of natural language scraped from the Internet and then used to train T5 and Imagen Video, it’s difficult to comprehensively filter out problematic data, including data that reinforce social biases or stereotypes against particular groups.

(06:25): Despite our inability to use Imagen Video ourselves, it is nevertheless a staggering development in the fields of natural language processing and creative artificial intelligence. Hopefully forthcoming approaches can resolve the thorniest social issues presented by these models so that we can all benefit from innovations like this.

(06:43): Thanks to Shaan Khosla, a data scientist on my team at my machine learning company Nebula for inspiring this Five-Minute Friday episode on Imagen Video today by providing a summary of the Imagen Video paper via his Let’s Talk Text Substack newsletter. He uses the newsletter to provide a weekly easy-to-read summary of a recent key natural language processing paper and you can subscribe to it for free if that’s something you’re interested in — we’ve provided a link to Shaan’s Substack in the show notes. 

(07:11): Ok, that’s it for this episode. Until next time, keep on rockin’ it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 

Show all

arrow_downward

Share on