Unmasking Trees for Tabular Data

doi:10.48550/arXiv.2407.05593

Unmasking Trees for Tabular Data

McCarter, Calvin

Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. This approach offers state-of-the-art performance on imputation, and on generation given training data with missingness; and it has competitive performance on vanilla generation. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions; unlike newer diffusion methods, it offers fast sampling, closed-form density estimation, and flexible handling of discrete variables. We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.

Publication:

arXiv e-prints

Pub Date:

July 2024

DOI:

10.48550/arXiv.2407.05593

arXiv:

arXiv:2407.05593

Bibcode:

2024arXiv240705593M

Keywords:

Computer Science - Machine Learning;
Statistics - Machine Learning

E-Print:

v0.3.0 of UnmaskingTrees software

NASA/ADS

Unmasking Trees for Tabular Data

Abstract