Efficient Deployment of Large Language Models on Resource-constrained Devices
Abstract
Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed to adaptively determine different pruning ratios and LoRA ranks for heterogeneous devices without any prior knowledge of their computing and communication capabilities. As a result, FedSpine maintains higher inference accuracy while improving fine-tuning efficiency. Experimental results conducted on a physical platform with 80 devices demonstrate that FedSpine can speed up fine-tuning by 1.4$\times$-6.9$\times$ and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.
- Publication:
-
arXiv e-prints
- Pub Date:
- January 2025
- DOI:
- arXiv:
- arXiv:2501.02438
- Bibcode:
- 2025arXiv250102438Y
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Artificial Intelligence;
- Computer Science - Computation and Language;
- Computer Science - Distributed;
- Parallel;
- and Cluster Computing