Transferable Perturbations of Deep Feature Distributions

doi:10.48550/arXiv.2004.12519

Transferable Perturbations of Deep Feature Distributions

Almost all current adversarial attacks of CNN classifiers rely on information derived from the output layer of the network. This work presents a new adversarial attack based on the modeling and exploitation of class-wise and layer-wise deep feature distributions. We achieve state-of-the-art targeted blackbox transfer-based attack results for undefended ImageNet models. Further, we place a priority on explainability and interpretability of the attacking process. Our methodology affords an analysis of how adversarial attacks change the intermediate feature distributions of CNNs, as well as a measure of layer-wise and class-wise feature distributional separability/entanglement. We also conceptualize a transition from task/data-specific to model-specific features within a CNN architecture that directly impacts the transferability of adversarial examples.

Publication:

arXiv e-prints

Pub Date:

April 2020

DOI:

10.48550/arXiv.2004.12519

arXiv:

arXiv:2004.12519

Bibcode:

2020arXiv200412519I

Keywords:

Computer Science - Machine Learning;
Statistics - Machine Learning

E-Print:

Published as a conference paper at ICLR 2020

NASA/ADS

Transferable Perturbations of Deep Feature Distributions

Abstract