Inference with Randomized Regression Trees
Abstract
Regression trees are a popular machine learning algorithm that fit piecewise constant models by recursively partitioning the predictor space. In this paper, we focus on performing statistical inference in a data-dependent model obtained from the fitted tree. We introduce Randomized Regression Trees (RRT), a novel selective inference method that adds independent Gaussian noise to the gain function underlying the splitting rules of classic regression trees. The RRT method offers several advantages. First, it utilizes the added randomization to obtain an exact pivot using the full dataset, while accounting for the data-dependent structure of the fitted tree. Second, with a small amount of randomization, the RRT method achieves predictive accuracy similar to a model trained on the entire dataset. At the same time, it provides significantly more powerful inference than data splitting methods, which rely only on a held-out portion of the data for inference. Third, unlike data splitting approaches, it yields intervals that adapt to the signal strength in the data. Our empirical analyses highlight these advantages of the RRT method and its ability to convert a purely predictive algorithm into a method capable of performing reliable and powerful inference in the tree model.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2024
- DOI:
- arXiv:
- arXiv:2412.20535
- Bibcode:
- 2024arXiv241220535B
- Keywords:
-
- Statistics - Methodology
- E-Print:
- 49 pages, 6 figures