A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis
Abstract
The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods. More information can be found on the project website https://evm7.github.io/UNIMASKM-page/
- Publication:
-
arXiv e-prints
- Pub Date:
- August 2023
- DOI:
- 10.48550/arXiv.2308.07301
- arXiv:
- arXiv:2308.07301
- Bibcode:
- 2023arXiv230807301V
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Graphics;
- Computer Science - Robotics
- E-Print:
- Accepted to AAAI2024. Webpage: https://evm7.github.io/UNIMASKM-page/