Compositional Video Generation as Flow Equalization

National University of Singapore

We introduce Vico, model-agnoistic and training-free framework for compositional video generation.

Vico Results

Vico provides a unified solution for compositional video generation by equalizing the information flow of all text tokens.

Abstract

Large-scale Text-to-Video (T2V) diffusion models struggle to fully grasp complex compositional interactions between multiple concepts and actions. This issue arises when some words dominantly influence the final video, overshadowing other concepts.

To tackle this problem, we introduce Vico, a generic framework for compositional video generation that explicitly ensures all concepts are represented properly. At its core, Vico analyzes how input tokens influence the generated video, and adjusts the model to prevent any single concept from dominating. We apply our method to multiple diffusion-based video models for compositional T2V and video editing. Empirical results demonstrate that our framework significantly enhances the compositional richness and accuracy of the generated videos.

What we have done?

We mordify the original text-to-video diffusion model to ensure the flow of information from text to video is equalized. It involve two steps:
1. Denoising Step diffusion model performs normal denoising.
2. Equalization Step Flow value of each token is equalized by treating attention as a graph.

Vico

Why Flow? not use Cross-Attention alone?

Cross-Attention only focuses on the relationship between the text and each independent frames, leading to flickering patterns.
ST-Flow focuses on the relationship across the full video.

Cross-Attention ST-Flow
Vico Vico
A playful kitten chasing a butterfly in a wildflower meadow. Cross-Attn on kitten Our ST-Flow on kitten Cross-Attn on meadow Our ST-Flow on meadow

Object Composition

Baseline +Vico Baseline +Vico
   
Clown fish swimming through the coral reef.     a boat boat and an airplane.
   
a toaster and a teddy bear.     a pizza and a tie.

Motion Composition

Baseline +Vico Baseline +Vico
   
a boat is sailing and a flag is waving.     a goat is climbing and a kid is jumping.
   
a crocodile is swimming and a bird is flying.     a child is laughing and a dog is wagging.

Related Resource

BibTeX

@misc{yang2024compositional,
      title={Compositional Video Generation as Flow Equalization},
      author={Xingyi Yang and Xinchao Wang},
      year={2024},
      eprint={2407.06182},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
  }