Understanding Representations by Exploring Galaxies in Chemical Space
Abstract
We present a Monte Carlo approach for studying chemical feature distributions of molecules without training a machine learning model or performing exhaustive enumeration. The algorithm generates molecules with predefined similarity to a given one for any representation. It serves as a diagnostic tool to understand which molecules are grouped in feature space and to identify shortcomings of representations and embeddings from unsupervised learning. In this work, we first study clusters surrounding chosen molecules and demonstrate that common representations do not yield a constant density of molecules in feature space, with possible implications for learning behavior. Next, we observe a connection between representations and properties: a linear correlation between the property value of a central molecule and the average radial slope of that property in chemical space. Molecules with extremal property values have the largest property derivative values in chemical space, which provides a route to improve the data efficiency of a representation by tailoring it towards a given property. Finally, we demonstrate applications for sampling molecules with specified metric-dependent distributions to generate molecules biased toward graph spaces of interest.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2023
- DOI:
- 10.48550/arXiv.2309.09194
- arXiv:
- arXiv:2309.09194
- Bibcode:
- 2023arXiv230909194W
- Keywords:
-
- Physics - Chemical Physics