An evaluation of large pre-trained models for gesture recognition using synthetic videos
Abstract
In this work, we explore the possibility of using synthetically generated data for video-based gesture recognition with large pre-trained models. We consider whether these models have sufficiently robust and expressive representation spaces to enable "training-free" classification. Specifically, we utilize various state-of-the-art video encoders to extract features for use in k-nearest neighbors classification, where the training data points are derived from synthetic videos only. We compare these results with another training-free approach— zero-shot classification using text descriptions of each gesture. In our experiments with the RoCoG-v2 dataset, we find that using synthetic training videos yields significantly lower classification accuracy on real test videos compared to using a relatively small number of real training videos. We also observe that video backbones that were fine-tuned on classification tasks serve as superior feature extractors, and that the choice of fine-tuning data has a substantial impact on k-nearest neighbors performance. Lastly, we find that zero-shot text-based classification performs poorly on the gesture recognition task, as gestures are not easily described through natural language.
- Publication:
-
Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications II
- Pub Date:
- June 2024
- DOI:
- 10.1117/12.3013530
- arXiv:
- arXiv:2410.02152
- Bibcode:
- 2024SPIE13035E..0FR
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition
- E-Print:
- Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications II (SPIE Defense + Commercial Sensing, 2024)