Open-Vocabulary Federated Learning with Multimodal Prototyping
Abstract
Existing federated learning (FL) studies usually assume the training label space and test label space are identical. However, in real-world applications, this assumption is too ideal to be true. A new user could come up with queries that involve data from unseen classes, and such open-vocabulary queries would directly defect such FL systems. Therefore, in this work, we explicitly focus on the under-explored open-vocabulary challenge in FL. That is, for a new user, the global server shall understand her/his query that involves arbitrary unknown classes. To address this problem, we leverage the pre-trained vision-language models (VLMs). In particular, we present a novel adaptation framework tailored for VLMs in the context of FL, named as Federated Multimodal Prototyping (Fed-MP). Fed-MP adaptively aggregates the local model weights based on light-weight client residuals, and makes predictions based on a novel multimodal prototyping mechanism. Fed-MP exploits the knowledge learned from the seen classes, and robustifies the adapted VLM to unseen categories. Our empirical evaluation on various datasets validates the effectiveness of Fed-MP.
- Publication:
-
arXiv e-prints
- Pub Date:
- April 2024
- DOI:
- 10.48550/arXiv.2404.01232
- arXiv:
- arXiv:2404.01232
- Bibcode:
- 2024arXiv240401232Z
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Computer Vision and Pattern Recognition
- E-Print:
- Accepted at NAACL 2024