StructureBoost: Efficient Gradient Boosting for Structured Categorical Variables
Abstract
Gradient boosting methods based on Structured Categorical Decision Trees (SCDT) have been demonstrated to outperform numerical and one-hot-encodings on problems where the categorical variable has a known underlying structure. However, the enumeration procedure in the SCDT is infeasible except for categorical variables with low or moderate cardinality. We propose and implement two methods to overcome the computational obstacles and efficiently perform Gradient Boosting on complex structured categorical variables. The resulting package, called StructureBoost, is shown to outperform established packages such as CatBoost and LightGBM on problems with categorical predictors that contain sophisticated structure. Moreover, we demonstrate that StructureBoost can make accurate predictions on unseen categorical values due to its knowledge of the underlying structure.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2020
- DOI:
- 10.48550/arXiv.2007.04446
- arXiv:
- arXiv:2007.04446
- Bibcode:
- 2020arXiv200704446L
- Keywords:
-
- Statistics - Machine Learning;
- Computer Science - Machine Learning;
- Statistics - Applications;
- Statistics - Computation