文本到视频生成图像扩散模型的单样本微调

来自爱可可AI前沿推介

[CV] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

J Z Wu, Y Ge, X Wang, W Lei, Y Gu, W Hsu, Y Shan, X Qie, M Z Shou
[National University of Singapore & ARC Lab & Tencent PCG]

Tune-A-Video: 文本到视频生成图像扩散模型的单样本微调

要点:

提出一种新的单样本视频生成问题，消除了用大规模文本-视频数据集进行文本到视频(T2V)生成的训练负担；
预训练的文本到图像(T2I)模型在T2V生成中表现出有趣的特性，可以通过预训练的T2I扩散模型的有效单样本微调从文本提示生成视频；
Tune-A-Video能生成具有定制属性、主题、地点等的时间连贯的视频。

摘要：

为了重现文本到图像(T2I)生成的成功，最近在文本到视频(T2V)生成方面的工作使用大规模文本-视频数据集进行微调。然而，这种范式在计算上是昂贵的。人具有惊人的能力，只需一个样例就能学习新的视觉概念。本文研究了一个新的T2V生成问题——单样本视频生成，其中只有一对文本视频对用于训练开放域T2V生成器。本文提出采用基于海量图像数据预训练的T2I扩散模型自适应来进行T2V生成。其中包含两个关键洞察：1）T2I模型能生成与动词术语非常一致的图像；2）扩展T2I模型以生成多个图像，同时表现出令人惊讶的良好内容一致性。为进一步学习连续运动，本文提出用裁剪的稀疏因果注意力的Tune-A-Video，通过对预训练的T2I扩散模型进行高效的单样本微调，从文本提示中生成视频。Tune-A-Video能在各种应用上制作时时间一致的视频，例如更改主题或背景、属性编辑、样式转换，展示了该方法的多功能性和有效性。

论文：https://arxiv.org/abs/2212.11565

To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem—One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.

内容中包含的图片若涉及版权问题，请及时与我们联系删除

文本到视频生成图像扩散模型的单样本微调

评论