Open-Vocabulary Federated Learning with Multimodal Prototyping

doi:10.48550/arXiv.2404.01232

Open-Vocabulary Federated Learning with Multimodal Prototyping

Existing federated learning (FL) studies usually assume the training label space and test label space are identical. However, in real-world applications, this assumption is too ideal to be true. A new user could come up with queries that involve data from unseen classes, and such open-vocabulary queries would directly defect such FL systems. Therefore, in this work, we explicitly focus on the under-explored open-vocabulary challenge in FL. That is, for a new user, the global server shall understand her/his query that involves arbitrary unknown classes. To address this problem, we leverage the pre-trained vision-language models (VLMs). In particular, we present a novel adaptation framework tailored for VLMs in the context of FL, named as Federated Multimodal Prototyping (Fed-MP). Fed-MP adaptively aggregates the local model weights based on light-weight client residuals, and makes predictions based on a novel multimodal prototyping mechanism. Fed-MP exploits the knowledge learned from the seen classes, and robustifies the adapted VLM to unseen categories. Our empirical evaluation on various datasets validates the effectiveness of Fed-MP.

Publication:

arXiv e-prints

Pub Date:

April 2024

DOI:

10.48550/arXiv.2404.01232

arXiv:

arXiv:2404.01232

Bibcode:

2024arXiv240401232Z

Keywords:

Computer Science - Computation and Language;
Computer Science - Computer Vision and Pattern Recognition

E-Print:

Accepted at NAACL 2024

NASA/ADS

Open-Vocabulary Federated Learning with Multimodal Prototyping

Abstract