Papers Read on AI

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Wed May 08 2024

Diffusion modelsContent generationStory diffusionConsistencyControllabilityText-to-image generationVideo generationTransition video generationSelf-attentionSemantic motion predictor

Description

Diffusion models have shown potential in content generation but struggle with maintaining consistency in generated images and videos for storytelling. A new self-attention calculation method, consistent self-attention, is proposed to enhance the consistency between generated images and text prompts without additional training. The framework 'story diffusion' combines consistent self-attention and a semantic motion predictor to generate stable and subject-consistent images or videos for storytelling. The proposed approach aims to provide controllability over generated content while ensuring consistency in characters' identity and attire.

Insights

Diffusion models have shown potential in content generation

Diffusion models have gained popularity and strong performance in various domains like image, video, 3D generation, and low-level vision tasks.

Consistent self-attention enhances the consistency between generated images and text prompts

A new self-attention calculation method, consistent self-attention, is proposed to enhance the consistency between generated images and text prompts without additional training.

Story diffusion combines consistent self-attention and a semantic motion predictor

The framework 'story diffusion' combines consistent self-attention and a semantic motion predictor to generate stable and subject-consistent images or videos for storytelling.

Controllability and consistency are key goals of the proposed approach

The proposed approach aims to provide controllability over generated content while ensuring consistency in characters' identity and attire.

Text-to-image generation is an important subfield of diffusion model applications

Text-to-image generation is an important subfield of diffusion model applications with methods focusing on enhancing controllability.

Video generation based on text descriptors has attracted attention

Video generation based on text descriptors has attracted attention with advancements in computational efficiency and quality.

Transition video generation methods aim to predict intermediate content between start and end frames

Transition video generation methods aim to predict intermediate content between start and end frames using temporal networks.

Consistent self-attention is utilized to maintain character consistency within a batch of images

Consistent self-attention is utilized to maintain character consistency within a batch of images in story diffusion.

The proposed semantic motion predictor encodes spatial information for accurate motion prediction

The proposed semantic motion predictor encodes spatial information for accurate motion prediction in video generation tasks.

Semantic motion predictor improves generation of smooth transition videos with large motion

Semantic motion predictor improves generation of smooth transition videos with large motion in story diffusion.

Summary

Transcript

Diffusion Models and Content Generation

00:08 - 08:09

Diffusion models have gained popularity and strong performance in various domains like image, video, 3D generation, and low-level vision tasks.
Text-to-image generation is an important subfield of diffusion model applications with methods focusing on enhancing controllability.

Story Diffusion and Consistency

15:20 - 23:33

Story diffusion involves generating subject-consistent images and transition videos in a training-free manner
Consistent self-attention is utilized to maintain character consistency within a batch of images
The proposed semantic motion predictor encodes spatial information for accurate motion prediction in video generation tasks

Advantages of Story Diffusion

23:06 - 31:02

Semantic motion predictor improves generation of smooth transition videos with large motion
Method achieves subject consistent images and videos through training-free and pluggable properties
Quantitative comparisons show method's robustness in maintaining character and conforming to prompt descriptions
Story diffusion outperforms state-of-the-art methods in generating smooth and physically plausible transition videos
User study confirms overwhelming advantage of story diffusion in image and video generation