Large-scale Text-to-Video (T2V) diffusion models struggle to fully grasp complex compositional interactions between multiple concepts and actions. This issue arises when some words dominantly influence the final video, overshadowing other concepts.
To tackle this problem, we introduce Vico, a generic framework for compositional video generation that explicitly ensures all concepts are represented properly. At its core, Vico analyzes how input tokens influence the generated video, and adjusts the model to prevent any single concept from dominating. We apply our method to multiple diffusion-based video models for compositional T2V and video editing. Empirical results demonstrate that our framework significantly enhances the compositional richness and accuracy of the generated videos.
We mordify the original text-to-video diffusion model to ensure the flow of information from text to video is equalized. It involve two steps:
1. Denoising Step diffusion model performs normal denoising.
2. Equalization Step Flow value of each token is equalized by treating attention as a graph.
Cross-Attention only focuses on the relationship between the text and each independent frames, leading to flickering patterns.
ST-Flow focuses on the relationship across the full video.
Cross-Attention | ST-Flow |
A playful kitten chasing a butterfly in a wildflower meadow. | Cross-Attn on kitten | Our ST-Flow on kitten | Cross-Attn on meadow | Our ST-Flow on meadow |
Baseline | +Vico | Baseline | +Vico | |
---|---|---|---|---|
    | ||||
Clown fish swimming through the coral reef. |     | a boat boat and an airplane. | ||
    | ||||
a toaster and a teddy bear. |     | a pizza and a tie. |
Baseline | +Vico | Baseline | +Vico | |
---|---|---|---|---|
    | ||||
a boat is sailing and a flag is waving. |     | a goat is climbing and a kid is jumping. | ||
    | ||||
a crocodile is swimming and a bird is flying. |     | a child is laughing and a dog is wagging. |
We particularly thanks some great paper on compositional generation and text-to-video generation
@misc{yang2024compositional,
title={Compositional Video Generation as Flow Equalization},
author={Xingyi Yang and Xinchao Wang},
year={2024},
eprint={2407.06182},
archivePrefix={arXiv},
primaryClass={cs.CV}
}