High Fidelity Image Synthesis With Deep VAEs In Latent Space
Abstract
We present fast, realistic image generation on high-resolution, multimodal datasets using hierarchical variational autoencoders (VAEs) trained on a deterministic autoencoder's latent space. In this two-stage setup, the autoencoder compresses the image into its semantic features, which are then modeled with a deep VAE. With this method, the VAE avoids modeling the fine-grained details that constitute the majority of the image's code length, allowing it to focus on learning its structural components. We demonstrate the effectiveness of our two-stage approach, achieving a FID of 9.34 on the ImageNet-256 dataset which is comparable to BigGAN. We make our implementation available online.
- Publication:
-
arXiv e-prints
- Pub Date:
- March 2023
- DOI:
- arXiv:
- arXiv:2303.13714
- Bibcode:
- 2023arXiv230313714L
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Machine Learning;
- Electrical Engineering and Systems Science - Image and Video Processing
- E-Print:
- 19 pages, 16 figures