G-MATT: Single-step retrosynthesis prediction using molecular grammar tree transformer
Abstract
Various template-based and template-free approaches have been proposed for single-step retrosynthesis prediction in recent years. While these approaches demonstrate strong performance from a data-driven metrics standpoint, many model architectures do not incorporate underlying chemistry principles. Here, we propose a novel chemistry-aware retrosynthesis prediction framework that combines powerful data-driven models with prior domain knowledge. We present a tree-to-sequence transformer architecture that utilizes hierarchical SMILES grammar-based trees, incorporating crucial chemistry information that is often overlooked by SMILES text-based representations, such as local structures and functional groups. The proposed framework, grammar-based molecular attention tree transformer (G-MATT), achieves significant performance improvements compared to baseline retrosynthesis models. G-MATT achieves a promising top-1 accuracy of 51% (top-10 accuracy of 79.1%), an invalid rate of 1.5%, and a bioactive similarity rate of 74.8% on the USPTO-50K dataset. Additional analyses of G-MATT attention maps demonstrate the ability to retain chemistry knowledge without relying on excessively complex model architectures.
- Publication:
-
AIChE Journal
- Pub Date:
- January 2024
- DOI:
- 10.1002/aic.18244
- arXiv:
- arXiv:2305.03153
- Bibcode:
- 2024AIChE..70E8244Z
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Artificial Intelligence;
- Computer Science - Formal Languages and Automata Theory;
- Computer Science - Symbolic Computation;
- Quantitative Biology - Quantitative Methods