Authors:
(1) Dan Kondratyuk, Google Research and with Equal contribution;
(2) Lijun Yu, Google Research, Carnegie Mellon University and with Equal contribution;
(3) Xiuye Gu, Google Research and with Equal contribution;
(4) Jose Lezama, Google Research and with Equal contribution;
(5) Jonathan Huang, Google Research and with Equal contribution;
(6) Grant Schindler, Google Research;
(7) Rachel Hornung, Google Research;
(8) Vighnesh Birodkar, Google Research;
(9) Jimmy Yan, Google Research;
(10) Krishna Somandepalli, Google Research;
(11) Hassan Akbari, Google Research;
(12) Yair Alon, Google Research;
(13) Yong Cheng, Google DeepMind;
(14) Josh Dillon, Google Research;
(15) Agrim Gupta, Google Research;
(16) Meera Hahn, Google Research;
(17) Anja Hauth, Google Research;
(18) David Hendon, Google Research;
(19) Alonso Martinez, Google Research;
(20) David Minnen, Google Research;
(21) Mikhail Sirotenko, Google Research;
(22) Kihyuk Sohn, Google Research;
(23) Xuan Yang, Google Research;
(24) Hartwig Adam, Google Research;
(25) Ming-Hsuan Yang, Google Research;
(26) Irfan Essa, Google Research;
(27) Huisheng Wang, Google Research;
(28) David A. Ross, Google Research;
(29) Bryan Seybold, Google Research and with Equal contribution;
(30) Lu Jiang, Google Research and with Equal contribution.
3. Model Overview and 3.1. Tokenization
3.2. Language Model Backbone and 3.3. Super-Resolution
4. LLM Pretraining for Generation
5. Experiments
5.2. Pretraining Task Analysis
5.3. Comparison with the State-of-the-Art
5.4. LLM’s Diverse Capabilities in Video Generation and 5.5. Limitations
6. Conclusion, Acknowledgements, and References
We present VideoPoet, a model for synthesizing high-quality videos from a large variety of conditioning signals. VideoPoet employs a decoderonly transformer architecture that processes multimodal inputs – including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: Pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that is adapted to a range of video generation tasks. We present results demonstrating the model’s state-of-the-art capabilities in zero-shot video generation, specifically highlighting the generation of high-fidelity motions.
Recently, there has been a surge of generative video models capable of a variety of video creation tasks. These include text-to-video (Zhang et al., 2023a; Singer et al., 2022), image-to-video (Yu et al., 2023d), video-to-video stylization (Chen et al., 2023b; Chai et al., 2023; Voleti et al., 2022), and video editing (Ceylan et al., 2023; Wang et al., 2023b; Geyer et al., 2023) among other video applications. Most existing models employ diffusion-based methods for video generation. These video models typically start with a pretrained image model, such as Stable Diffusion (Rombach et al., 2022; Podell et al., 2023), that produces high-fidelity images for individual frames, and then fine-tune the model to improve temporal consistency across video frames.
While Large Language Models (LLMs) are commonly used as foundation models across various modalities including language (Brown et al., 2020), code (Li et al., 2023; OpenAI, 2023), audio (Rubenstein et al., 2023), speech (Agostinelli et al., 2023), and robotics (Driess et al., 2023; Zitkovich et al., 2023), the diffusion model remains the predominant approach for video generation. Although early research has demonstrated the effectiveness of LLMs in text-to-image generation, e.g., DALL-E (Ramesh et al., 2022), Parti (Yu et al., 2022) and (Ding et al., 2021), and text-to-video, e.g., CogVideo (Hong et al., 2022)), language models have not reached a level of quality on par with video diffusion models in tasks like text-to-video generation as shown in previous studies (Nash et al., 2022; Villegas et al., 2022). In contrast to training exclusively for text-to-video tasks, the generative model of LLMs in the language domain emphasizes a large pretraining stage to learn a foundation (Bommasani et al., 2021) by examining pretraining tasks that extend beyond text-to-video generation.
A notable advantage of employing LLMs in video generation lies in the ease of integrating existing LLM frameworks. This integration allows for reusing LLM infrastructure and leverages the optimizations our community has developed over many years for LLMs, including optimizations in learning recipes for model scaling (Brown et al., 2020; Chowdhery et al., 2022), training and inference infrastructure (Du et al., 2022), hardware, among other advancements. This couples with their flexibility in encoding many diverse tasks in the same model (Raffel et al., 2020), which stands in contrast to most diffusion models where architectural changes and adapter modules are the dominant approach used to adapt the model to more diverse tasks (Zhang et al., 2023b).
In this paper, we exploit language models for video generation, following the canonical training protocols of LLMs in the language domain. We introduce VideoPoet, a language model for video generation. VideoPoet employs a decoderonly LLM architecture (Anil et al., 2023; OpenAI, 2023) that admits image, video, and audio modalities as discrete tokens, each produced by their respective tokenizer.
The training process of VideoPoet consists of two stages: (1) pretraining and (2) task-adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal pretraining objectives within an autoregressive transformer framework. After pretraining, the model functions as a versatile multitask video generation model such as text-to-video, image-tovideo, video editing and video-to-video stylization. These capabilities are inherently integrated into a single LLM, rather than relying on a separate generative model controlled by text prompts (Tang et al., 2023). During subsequent task-adaptation, the pretrained model can be further fine-tuned either to enhance its generation quality on the training tasks or to perform new tasks.
Experiments show VideoPoet’s state-of-the-art capabilities in generating videos with large and high-fidelity motions. Through the powerful capabilities of the transformer architecture, VideoPoet can be straightforwardly trained on a multi-task, multimodal generative objective, allowing for generating consistent and realistic motion driven by text or other prompts. Furthermore, VideoPoet can synthesize coherent long videos of up to 10 seconds by autoregressively extending the content, conditioned on the last second of the generated video.
We also demonstrate that VideoPoet is capable of zero-shot video generation. We use the term “zero-shot video generation” as VideoPoet processes new text, image, or video inputs that diverge from the training data distribution. Furthermore, VideoPoet handles new tasks not included in its training. For example, VideoPoet is able to perform new editing tasks by sequentially chaining training tasks together. The main contributions of this work are:
• A method for training a Large Language Model (LLM) specifically for video generation tasks, utilizing tokenized video data that incorporates both text-paired and unpaired video data.
• An approach to video super-resolution that increases spatial resolution within the latent token space using a bidirectional transformer with efficient windowed local attention.
• Evaluations and demonstrations to highlight VideoPoet’s competitive and state-of-the-art performance, especially in generating realistic and interesting videos with motion.
This paper is available on arxiv under CC BY 4.0 DEED license.