模型地址:https://modelscope.cn/models/damo/text-to-video-synthesis/summary

模型描述

该文本到视频生成扩散模型由文本特征提取、文本特征到视频隐空间扩散模型、视频隐空间到视频视觉空间这3个子网络组成,整体模型参数约17亿。支持英文输入。扩散模型采用Unet3D结构,通过从纯高斯噪声视频中,迭代去噪的过程,实现视频生成的功能。

代码示例

from huggingface_hub import snapshot_download

from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys

model_dir = pathlib.Path('weights')
snapshot_download('damo-vilab/modelscope-damo-text-to-video-synthesis',
                   repo_type='model', local_dir=model_dir)

pipe = pipeline('text-to-video-synthesis', model_dir.as_posix())
test_text = {
        'text': 'A panda eating bamboo on a rock.',
    }
output_video_path = pipe(test_text,)[OutputKeys.OUTPUT_VIDEO]
print('output_video_path:', output_video_path)

部分示例

A goldendoodle playing in a park by a lake.
A goldendoodle playing in a park by a lake.

 

A panda bear driving a car.
A panda bear driving a car.

 

相关论文

@inproceedings{rombach2022high,
  title={High-resolution image synthesis with latent diffusion models},
  author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\"o}rn},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={10684--10695},
  year={2022}
}

@inproceedings{luo2023videofusion,
  title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
  author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}