Humans possess an intricate and powerful visual system in order to perceive and understand the environing world. Human perception can effortlessly detect and correctly group features in visual data and can even interpret random-dot videos induced by imaging natural dynamic scenes with highly noisy sensors such as ultrasound imaging. Remarkably, this happens even if perception completely fails when the same information is presented frame by frame rather than in a video sequence. We study this property of surprising dynamic perception with the first goal of proposing a new detection and spatio-temporal grouping algorithm for such signals when, per frame, the information on objects is both random and sparse and embedded in random noise. The algorithm is based on the succession of temporal integration and spatial statistical tests of unlikeliness, the a contrario framework. The algorithm not only manages to handle such signals but the striking similarity in its performance to the perception by human observers, as witnessed by a series of psychophysical experiments on image and video data, leads us to see in it a simple computational Gestalt model of human perception with only two parameters: the time integration and the visual angle for candidate shapes to be detected.