Prune Your Model Before Distill It

doi:10.48550/arXiv.2109.14960

Prune Your Model Before Distill It

Knowledge distillation transfers the knowledge from a cumbersome teacher to a small student. Recent results suggest that the student-friendly teacher is more appropriate to distill since it provides more transferable knowledge. In this work, we propose the novel framework, "prune, then distill," that prunes the model first to make it more transferrable and then distill it to the student. We provide several exploratory examples where the pruned teacher teaches better than the original unpruned networks. We further show theoretically that the pruned teacher plays the role of regularizer in distillation, which reduces the generalization error. Based on this result, we propose a novel neural network compression scheme where the student network is formed based on the pruned teacher and then apply the "prune, then distill" strategy. The code is available at https://github.com/ososos888/prune-then-distill

Publication:

arXiv e-prints

Pub Date:

September 2021

DOI:

10.48550/arXiv.2109.14960

arXiv:

arXiv:2109.14960

Bibcode:

2021arXiv210914960P

Keywords:

Computer Science - Machine Learning

E-Print:

Accepted at ECCV2022

NASA/ADS

Prune Your Model Before Distill It

Abstract