Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

简介

ControlNets被广泛用于在图像生成中添加空间控制，例如深度图、Canny边缘和人体姿势等不同条件。然而，当利用预训练的图像ControlNets进行控制视频生成时，存在几个挑战。首先，由于特征空间不匹配，预训练的ControlNet不能直接插入新的主干模型中，而为新主干训练ControlNets的成本是一个巨大的负担。其次，不同帧的ControlNet特征可能无法有效处理时间一致性。为了解决这些挑战，我们介绍了Ctrl-Adapter，这是一个高效且通用的框架，通过适应预训练的ControlNets（并改进视频的时间对齐）来为任何图像/视频扩散模型添加多样化的控制。Ctrl-Adapter提供了多种功能，包括图像控制、视频控制、稀疏帧的视频控制、多条件控制、与不同主干的兼容性、适应未见过的控制条件和视频编辑。在Ctrl-Adapter中，我们训练适配器层，将预训练的ControlNet特征融合到不同的图像/视频扩散模型中，同时保持ControlNets和扩散模型的参数不变。Ctrl-Adapter包括时间和空间模块，因此可以有效处理视频的时间一致性。我们还提出了潜在跳跃和逆时间步采样，以实现强健的适应和稀疏控制。此外，Ctrl-Adapter通过简单地取ControlNet输出的（加权）平均值，即可从多个条件进行控制。使用多种图像/视频扩散主干（SDXL、Hotshot-XL、I2VGen-XL和SVD），Ctrl-Adapter与ControlNet相匹配，用于图像控制，并且在视频控制方面优于所有基线（在DAVIS 2017数据集上实现SOTA准确性），并且计算成本显著降低（不到10个GPU小时）。
图表
解决问题

Ctrl-Adapter: Controllable Image and Video Generation Framework by Adapting Pretrained ControlNets
关键思路

The paper proposes Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion models, by adapting pretrained ControlNets and improving temporal alignment for videos.
其它亮点

Ctrl-Adapter provides diverse capabilities including image control, video control, video control with sparse frames, multi-condition control, compatibility with different backbones, adaptation to unseen control conditions, and video editing. The framework consists of temporal and spatial modules to handle the temporal consistency of videos. The paper also proposes latent skipping and inverse timestep sampling for robust adaptation and sparse control. Ctrl-Adapter matches ControlNet for image control and outperforms all baselines for video control (achieving the SOTA accuracy on the DAVIS 2017 dataset) with significantly lower computational costs (less than 10 GPU hours).
相关研究

Related work includes previous research on image and video generation using ControlNets and other methods such as GANs and VAEs. Some relevant papers include 'Controllable Person Image Synthesis with Attribute-Decomposed GAN' and 'Video Generation from Text'.

许愿开讲

PDF

原文

点赞收藏评论分享到Link

沙发等你来抢

去评论