Vico

Compositional Video Generation as Flow Equalization

National University of Singapore

We introduce Vico, model-agnoistic and training-free framework for compositional video generation.

Abstract

Large-scale Text-to-Video (T2V) diffusion models struggle to fully grasp complex compositional interactions between multiple concepts and actions. This issue arises when some words dominantly influence the final video, overshadowing other concepts.

To tackle this problem, we introduce Vico, a generic framework for compositional video generation that explicitly ensures all concepts are represented properly. At its core, Vico analyzes how input tokens influence the generated video, and adjusts the model to prevent any single concept from dominating. We apply our method to multiple diffusion-based video models for compositional T2V and video editing. Empirical results demonstrate that our framework significantly enhances the compositional richness and accuracy of the generated videos.

What we have done?

We mordify the original text-to-video diffusion model to ensure the flow of information from text to video is equalized. It involve two steps:
1. Denoising Step diffusion model performs normal denoising.
2. Equalization Step Flow value of each token is equalized by treating attention as a graph.

Why Flow? not use Cross-Attention alone?

Cross-Attention only focuses on the relationship between the text and each independent frames, leading to flickering patterns.
ST-Flow focuses on the relationship across the full video.


Cross-Attention	ST-Flow


A playful kitten chasing a butterfly in a wildflower meadow.	Cross-Attn on kitten	Our ST-Flow on kitten	Cross-Attn on meadow	Our ST-Flow on meadow

A playful kitten chasing a butterfly in a wildflower meadow.

Cross-Attn on kitten

Our ST-Flow on kitten

Cross-Attn on meadow

Our ST-Flow on meadow

Object Composition

Baseline	+Vico		Baseline	+Vico

Clown fish swimming through the coral reef.			a boat boat and an airplane.

a toaster and a teddy bear.			a pizza and a tie.

Motion Composition

Baseline	+Vico		Baseline	+Vico

a boat is sailing and a flag is waving.			a goat is climbing and a kid is jumping.

a crocodile is swimming and a bird is flying.			a child is laughing and a dog is wagging.

Related Resource

@misc{yang2024compositional, title={Compositional Video Generation as Flow Equalization}, author={Xingyi Yang and Xinchao Wang}, year={2024}, eprint={2407.06182}, archivePrefix={arXiv}, primaryClass={cs.CV} }