VideoPoet — A large language model for zero-shot video generation

3 min readDec 27, 2023

VideoPoet : A large language model for zero-shot video generation

One after another, everyone is launching or announcing AI (Artificial Intelligence) tools. The latest addition to this list is VideoPoet, announced by Google Research. VideoPoet is currently not available for public use, but here’s what we know so far:

What is VideoPoet?

VideoPoet, a large language model (LLM) that is capable of a wide variety of video generation tasks, including text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio.

What does VideoPoet offer?

Text-to-video: VideoPoet generates high-motion variable length videos based on a input text prompt.
Video-to-audio: VideoPoet can produce audio that matches a given video without requiring any accompanying text.
Image-to-video generation: Using a text prompt, VideoPoet can create a video that matches any input image.
Zero-shot stylization: VideoPoet can stylize input videos based on a text prompt, ensuring stylistic coherence.
Long(er) video generation: By default, VideoPoet generates 2-second videos. However, the model can produce longer videos by predicting 1 second of video output using a 1-second video clip as input. This process can continue indefinitely to create videos of any duration.
Zero-shot controllable camera motions: VideoPoet’s pre-training allows for significant customization of high-quality camera motions by specifying desired camera shots in the text prompt.
Controllable video editing: The VideoPoet model can edit subjects to mimic various motions, such as adding laughter and dance styles.

Technical Details

VideoPoet uses a pre-trained MAGVIT V2 video tokenizer and a SoundStream audio tokenizer which transform images, video, and audio clips with variable lengths into a sequence of discrete codes in a unified vocabulary. These codes are compatible with text-based language models, facilitating an integration with other modalities, such as text.

A detailed look at the VideoPoet task design, showing the training and inference inputs and outputs of various tasks.

An autoregressive language model learns across video, image, audio, and text modalities to overaggressive predict the next video or audio token in the sequence.
A mixture of multi-modal generative learning objectives are introduced into the LLM training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video inpainting and outpainting, video stylization, and video-to-audio.

The diagram below illustrates VideoPoet’s capabilities. Input images can be animated to produce motion, and (optionally cropped or masked) video can be edited for inpainting or outpainting. For stylization, the model takes in a video representing the depth and optical flow, which represent the motion, and paints contents on top to produce the text-guided style.

Videos generated by VideoPoet from various text prompts 🙀

Rookie the Raccoon — An AI Generated movie by VideoPoet

Resources:

https://blog.research.google/2023/12/videopoet-large-language-model-for-zero.html
https://sites.research.google/videopoet

If you like this article make sure to clapp-clapp 👏 and follow me on Medium for more such articles. Suggestions in the comments are always welcome :)

As content generation is not a easy process soooo, I wouldn’t mind if you gift me Ko-fi to motivate and boost my confidence :)