CustomVideo

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

¹The Chinese University of Hong Kong ²Huawei Noah Ark's Lab ³The University of Hong Kong

Abstract

Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches designed for single subjects suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, we aim to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific object area, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 63 individual subjects from 13 different categories and 68 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method, compared with the previous state-of-the-art approaches.

Method

We propose a simple yet effective co-occurrence and attention control mechanism with mask guidance to preserve the the fidelity of subjects for multi-subject driven text-to-video generation. During the training stage, only the key and value weights in the cross-attention layers are fine-tuned, together with a learnable word token for every subject. In the inference stage, given a text prompt integrated with learned word tokens, we can easily obtain high-quality videos with specific subjects.

Comparison with SOTA

cat*

CustomVideo

DreamBooth[1]

CustomDiffusion[2]

dog*

VideoDreamer[3]

DisenDiff[4]+SVD-XT[5]

DreamVideo[6]

a cat* and a dog* sitting on the beach with a view of seashore

car*

CustomVideo

DreamBooth[1]

CustomDiffusion[2]

barn*

VideoDreamer[3]

DisenDiff[4]+SVD-XT[5]

DreamVideo[6]

a front view of a car* stopping in front of a barn*

Qualitative results of our CustomVideo with comparison to SOTA methods. We can observe that our CustomVideo
can generate videoswith much better fidelity of subjects compared with previous SOTA methods.

Two-subject T2V Customization

a cat* and a dog* walking in the Times Square

an anime character* playing a guitar* in front of Tower Bridge

a toy bear* riding a bike* on the beach

a front view of a woodenpot* with a flower* in it on beach with a view of seashore

a cat* playing with a dog*

a lamp* shining on a sofa*

a monster* and a motorbike* on the beach with a view of seashore

Three-subject T2V Customization

a cat* sitting on the grass, wearing sunglasses, with a lighthouse in the background

with a pyramid* in the background, a dog* sitting on the grass, wearing sunglasses*

References

[1] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation CVPR, 2023.

[2] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion CVPR, 2023.

[3] Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, and Wenwu Zhu. Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning. arXiv, 2023.

[4] Yanbing Zhang, Mengping Yang, Qin Zhou, and Zhe Wang. Attention calibration for disentangled text-to-image personalization. CVPR, 2024

[5] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets. CVPR, 2024.

[6] Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hongming Shan. Dreamvideo: Composing your dream videos with customized subject and motion. CVPR, 2024.

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Overview of Comparison and Generated Videos

Abstract

Method

Comparison with SOTA

Two-subject T2V Customization

an anime character* playing a guitar* in front of Tower Bridge

a toy bear* riding a bike* on the beach

a front view of a woodenpot* with a flower* in it on beach with a view of seashore

a cat* playing with a dog*

a lamp* shining on a sofa*

a monster* and a motorbike* on the beach with a view of seashore

Three-subject T2V Customization

a cat* sitting on the grass, wearing sunglasses*, with a lighthouse* in the background

with a pyramid* in the background, a dog* sitting on the grass, wearing sunglasses*

References

a cat* sitting on the grass, wearing sunglasses, with a lighthouse in the background