A brief prehistory of double descent
Abstract
In their thought-provoking paper [1], Belkin et al. illustrate and discuss the shape of risk curves in the context of modern high-complexity learners. Given a fixed training sample size $n$, such curves show the risk of a learner as a function of some (approximate) measure of its complexity $N$. With $N$ the number of features, these curves are also referred to as feature curves. A salient observation in [1] is that these curves can display, what they call, double descent: with increasing $N$, the risk initially decreases, attains a minimum, and then increases until $N$ equals $n$, where the training data is fitted perfectly. Increasing $N$ even further, the risk decreases a second and final time, creating a peak at $N=n$. This twofold descent may come as a surprise, but as opposed to what [1] reports, it has not been overlooked historically. Our letter draws attention to some original, earlier findings, of interest to contemporary machine learning.
- Publication:
-
Proceedings of the National Academy of Science
- Pub Date:
- May 2020
- DOI:
- arXiv:
- arXiv:2004.04328
- Bibcode:
- 2020PNAS..11710625L
- Keywords:
-
- Computer Science - Machine Learning
- E-Print:
- doi:10.1073/pnas.2001875117