Sentence-level Privacy for Document Embeddings

doi:10.48550/arXiv.2205.04605

Sentence-level Privacy for Document Embeddings

User language data can contain highly sensitive personal content. As such, it is imperative to offer users a strong and interpretable privacy guarantee when learning from their data. In this work, we propose SentDP: pure local differential privacy at the sentence level for a single user document. We propose a novel technique, DeepCandidate, that combines concepts from robust statistics and language modeling to produce high-dimensional, general-purpose $\epsilon$-SentDP document embeddings. This guarantees that any single sentence in a document can be substituted with any other sentence while keeping the embedding $\epsilon$-indistinguishable. Our experiments indicate that these private document embeddings are useful for downstream tasks like sentiment analysis and topic classification and even outperform baseline methods with weaker guarantees like word-level Metric DP.

Publication:

arXiv e-prints

Pub Date:

May 2022

DOI:

10.48550/arXiv.2205.04605

arXiv:

arXiv:2205.04605

Bibcode:

2022arXiv220504605M

Keywords:

Computer Science - Machine Learning;
Computer Science - Computation and Language

E-Print:

Presented at ACL 2022 main conference

NASA/ADS

Sentence-level Privacy for Document Embeddings

Abstract