Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection

Anonymous Authors

TL;DR: Unifying image editing method via 2D image text-to-image(T2I) Diffusion Model

with shared self-attention features and consecutive image sampling processes.

Abstract

While text-to-image models have achieved impressive capabilities in image generation and editing, their application across various modalities often necessitates training separate models. We propose a novel editing method enabling unified editing across modalities utilizing only a basic 2D image text-to-image (T2I) diffusion model. We design a sampling method that facilitates editing consecutive images while maintaining semantic consistency utilizing shared self-attention features during both reference and consecutive image sampling processes. Inspired by existing method of single image editing with self attention injection and video editing with shared attention, we propose a novel unified editing framework that combines the strengths of both approaches. Different from previous work, our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.

Panorama Editing

Original
UnifyEdit(Ours)

3D Scene Editing


Original NeRF
"flowers,ukiyo-e style"
"lego flowers"
"flowers,watercolor style"

"a sketch of flowers"
Original NeRF
CLIP-NeRF
NeRF_ART
Instruct NeRF2NeRF
UnifyEdit (ours)
"ice castle on the snow"
Original NeRF
CLIP-NeRF
NeRF_ART
Instruct NeRF2NeRF
UnifyEdit (ours)

Video Editing

"Cow walking on the water, Van Gogh style"
Original
Gen1
P2V
CSD
UnifyEdit(Ours)
"Golden statue cow walking on the rock"
Original
Gen1
P2V
CSD
UnifyEdit(Ours)
"Blue car running in snowy winter"
Original
Gen1
P2V
CSD
UnifyEdit(Ours)