Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion
Abstract
As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench.
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2024
- DOI:
- arXiv:
- arXiv:2402.16305
- Bibcode:
- 2024arXiv240216305L
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Artificial Intelligence