A Feasibility Study of Differentially Private Summary Statistics and Regression Analyses for Administrative Tax Data
Federal administrative tax data are invaluable for research, but because of privacy concerns, access to these data is typically limited to select agencies and a few individuals. An alternative to sharing microlevel data are validation servers, which allow individuals to query statistics without accessing the confidential data. This paper studies the feasibility of using differentially private (DP) methods to implement such a server. We provide an extensive study on existing DP methods for releasing tabular statistics, means, quantiles, and regression estimates. We also include new methodological adaptations to existing DP regression algorithms for using new data types and returning standard error estimates. We evaluate the selected methods based on the accuracy of the output for statistical analyses, using real administrative tax data obtained from the Internal Revenue Service Statistics of Income (SOI) Division. Our findings show that a validation server would be feasible for simple statistics but would struggle to produce accurate regression estimates and confidence intervals. We outline challenges and offer recommendations for future work on validation servers. This is the first comprehensive statistical study of DP methodology on a real, complex dataset, that has significant implications for the direction of a growing research field.