Advanced image synthesis methods can generate photo-realistic images for faces, birds, bedrooms, and more. However, these methods do not explicitly model and preserve essential structural constraints such as junctions, parallel lines, and planar surfaces. In this paper, we study the problem of structured indoor image generation for design applications. We utilize a small-scale dataset that contains both images of various indoor scenes and their corresponding ground-truth wireframe annotations. While existing image synthesis models trained on the dataset are insufficient in preserving structural integrity, we propose a novel model based on a structure-appearance joint embedding learned from both images and wireframes. In our model, structural constraints are explicitly enforced by learning a joint embedding in a shared encoder network that must support the generation of both images and wireframes. We demonstrate the effectiveness of the joint embedding learning scheme on the indoor scene wireframe to image translation task. While wireframes as input contain less semantic information than inputs of other traditional image translation tasks, our model can generate high fidelity indoor scene renderings that match well with input wireframes. Experiments on a wireframe-scene dataset show that our proposed translation model significantly outperforms existing state-of-the-art methods in both visual quality and structural integrity of generated images.