This paper explores the problem of hyperspectral image (HSI) super-resolution that merges a low resolution HSI (LR-HSI) and a high resolution multispectral image (HR-MSI). The cross-modality distribution of the spatial and spectral information makes the problem challenging. Inspired by the classic wavelet decomposition-based image fusion, we propose a novel lightweight deep neural network-based framework, namely progressive zero-centric residual network (PZRes-Net), to address this problem efficiently and effectively. Specifically, PZRes-Net learns a high resolution and zero-centric residual image, which contains high-frequency spatial details of the scene across all spectral bands, from both inputs in a progressive fashion along the spectral dimension. And the resulting residual image is then superimposed onto the up-sampled LR-HSI in a mean-value invariant manner, leading to a coarse HR-HSI, which is further refined by exploring the coherence across all spectral bands simultaneously. To learn the residual image efficiently and effectively, we employ spectral-spatial separable convolution with dense connections. In addition, we propose zero-mean normalization implemented on the feature maps of each layer to realize the zero-mean characteristic of the residual image. Extensive experiments over both real and synthetic benchmark datasets demonstrate that our PZRes-Net outperforms state-of-the-art methods to a significant extent in terms of both 4 quantitative metrics and visual quality, e.g., our PZRes-Net improves the PSNR more than 3dB, while saving 2.3$\times$ parameters and consuming 15$\times$ less FLOPs.