DMin: Scalable Training Data Influence Estimation for Diffusion Models

doi:10.48550/arXiv.2412.08637

DMin: Scalable Training Data Influence Estimation for Diffusion Models

Identifying the training data samples that most influence a generated image is a critical task in understanding diffusion models, yet existing influence estimation methods are constrained to small-scale or LoRA-tuned models due to computational limitations. As diffusion models scale up, these methods become impractical. To address this challenge, we propose DMin (Diffusion Model influence), a scalable framework for estimating the influence of each training data sample on a given generated image. By leveraging efficient gradient compression and retrieval techniques, DMin reduces storage requirements from 339.39 TB to only 726 MB and retrieves the top-k most influential training samples in under 1 second, all while maintaining performance. Our empirical results demonstrate DMin is both effective in identifying influential training samples and efficient in terms of computational and storage requirements.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2412.08637

arXiv:

arXiv:2412.08637

Bibcode:

2024arXiv241208637L

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Artificial Intelligence;
Computer Science - Machine Learning

E-Print:

14 pages, 6 figures, 8 tables. Under Review

ADS

DMin: Scalable Training Data Influence Estimation for Diffusion Models

Abstract