Towards Transformer-Based Aligned Generation with Self-Coherence Guidanc

Abstract

We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges.Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks.

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

Abstract

Qualitative analysis of our method compared to other SOTA methods.Our approach consistently generates high-quality images with superior alignment across coarse-grained attribute binding, fine-grained attribute binding, and style binding tasks..

During the generation process, the attention entropy across different layers in U-Net and DiT architectures reflects the semantic richness, with lower attention entropy indicating greater semantic information.The visualized token corresponds to the word "balloon".

Qualitative analysis of fine-grained attribute binding comparing our method with other SOTA approaches. Our method enables more precise control over the attributes of concepts.

Qualitative analysis of fine-grained attribute binding comparing our method with other SOTA approaches. Our method enables more precise control over the attributes of concepts.

Qualitative analysis of style binding comparing our method with other SOTA approaches. Our method effectively binds styles to different concepts.

Qualitative analysis of style binding comparing our method with other SOTA approaches. Our method effectively binds styles to different concepts.

Qualitative analysis of coarse-grained attribute binding comparing our method with other SOTA approaches. Our method not only achieves attribute control but also generates higher-quality concepts.

Qualitative analysis of coarse-grained attribute binding comparing our method with other SOTA approaches. Our method not only achieves attribute control but also generates higher-quality concepts.