A crucial ability of human intelligence is to build up models of individual 3D objects from partial scene observations. Recent works have enabled unsupervised 3D representation learning at scene-level, yet learning to decompose the 3D scene into 3D objects and build their individual models from multi-object scene images remains elusive. In this paper, we propose a probabilistic generative model for learning to build modular and compositional 3D object models from observations of a multi-object scene. The proposed model can (i) infer the 3D object representations by learning to search and group object areas and also (ii) render from an arbitrary viewpoint not only individual objects but also the full scene by compositing the objects. The entire learning process is unsupervised and end-to-end. We also demonstrate that the learned representation permits object-wise manipulation and novel scene generation, and generalizes to various settings.