A Gaussian Process Model for Ordinal Data with Applications to Chemoinformatics
Abstract
With the proliferation of screening tools for chemical testing, it is now possible to create vast databases of chemicals easily. However, rigorous statistical methodologies employed to analyse these databases are in their infancy, and further development to facilitate chemical discovery is imperative. In this paper, we present conditional Gaussian process models to predict ordinal outcomes from chemical experiments, where the inputs are chemical compounds. We implement the Tanimoto distance, a metric on the chemical space, within the covariance of the Gaussian processes to capture correlated effects in the chemical space. A novel aspect of our model is that the kernel contains a scaling parameter, a feature not previously examined in the literature, that controls the strength of the correlation between elements of the chemical space. Using molecular fingerprints, a numerical representation of a compound's location within the chemical space, we show that accounting for correlation amongst chemical compounds improves predictive performance over the uncorrelated model, where effects are assumed to be independent. Moreover, we present a genetic algorithm for the facilitation of chemical discovery and identification of important features to the compound's efficacy. A simulation study is conducted to demonstrate the suitability of the proposed methods. Our proposed methods are demonstrated on a hazard classification problem of organic solvents.
- Publication:
-
arXiv e-prints
- Pub Date:
- May 2024
- DOI:
- 10.48550/arXiv.2405.09989
- arXiv:
- arXiv:2405.09989
- Bibcode:
- 2024arXiv240509989G
- Keywords:
-
- Statistics - Applications;
- Statistics - Methodology;
- Statistics - Machine Learning