TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes

doi:10.48550/arXiv.2312.06499

TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes

Ensuring fairness in NLP models is crucial, as they often encode sensitive attributes like gender and ethnicity, leading to biased outcomes. Current concept erasure methods attempt to mitigate this by modifying final latent representations to remove sensitive information without retraining the entire model. However, these methods typically rely on linear classifiers, which leave models vulnerable to non-linear adversaries capable of recovering sensitive information. We introduce Targeted Concept Erasure (TaCo), a novel approach that removes sensitive information from final latent representations, ensuring fairness even against non-linear classifiers. Our experiments show that TaCo outperforms state-of-the-art methods, achieving greater reductions in the prediction accuracy of sensitive attributes by non-linear classifier while preserving overall task performance. Code is available on https://github.com/fanny-jourdan/TaCo.

Publication:

arXiv e-prints

Pub Date:

December 2023

DOI:

10.48550/arXiv.2312.06499

arXiv:

arXiv:2312.06499

Bibcode:

2023arXiv231206499J

Keywords:

Computer Science - Computation and Language;
Statistics - Machine Learning

NASA/ADS

TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes

Abstract