SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

doi:10.48550/arXiv.2412.10494

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2412.10494

arXiv:

arXiv:2412.10494

Bibcode:

2024arXiv241210494W

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Artificial Intelligence;
Computer Science - Machine Learning;
Computer Science - Performance

E-Print:

https://snap-research.github.io/snapgen-v/

ADS

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Abstract