A flexible and general semi-supervised approach to multiple hypothesis testing
Abstract
Standard multiple testing procedures are designed to report a list of discoveries, or suspected false null hypotheses, given the hypotheses' p-values or test scores. Recently there has been a growing interest in enhancing such procedures by combining additional information with the primary p-value or score. Specifically, such so-called ``side information'' can be leveraged to improve the separation between true and false nulls along additional ``dimensions'' thereby increasing the overall sensitivity. In line with this idea, we develop RESET (REScoring via Estimating and Training) which uses a unique data-splitting protocol that subsequently allows any semi-supervised learning approach to factor in the available side-information while maintaining finite-sample error rate control. Our practical implementation, RESET Ensemble, selects from an ensemble of classification algorithms so that it is compatible to a range of multiple testing scenarios without the need for the user to select the appropriate one. We apply RESET to both p-value and competition based multiple testing problems and show that RESET is (1) power-wise competitive, (2) fast compared to most tools and (3) is able to uniquely achieve finite sample FDR or FDP control, depending on the user's preference.
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2024
- DOI:
- arXiv:
- arXiv:2411.15771
- Bibcode:
- 2024arXiv241115771F
- Keywords:
-
- Statistics - Methodology