Zero-Shot Text-to-Image Generation

doi:10.48550/arXiv.2102.12092

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Publication:

arXiv e-prints

Pub Date:

February 2021

DOI:

10.48550/arXiv.2102.12092

arXiv:

arXiv:2102.12092

Bibcode:

2021arXiv210212092R

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Machine Learning

NASA/ADS

Zero-Shot Text-to-Image Generation

Abstract