Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention

doi:10.48550/arXiv.2501.00823

Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention

Transformers have achieved remarkable success across diverse domains, but their monolithic architecture presents challenges in interpretability, adaptability, and scalability. This paper introduces a novel modular Transformer architecture that explicitly decouples knowledge and reasoning through a generalized cross-attention mechanism to a shared knowledge base, specifically designed for effective knowledge retrieval. Critically, we provide a rigorous mathematical derivation demonstrating that the Feed-Forward Network (FFN) in a standard Transformer is a specialized case (a closure) of this generalized cross-attention, revealing its role in implicit knowledge retrieval and validating our design. This theoretical framework provides a new lens for understanding FFNs and lays the foundation for future research exploring enhanced interpretability, adaptability, and scalability, enabling richer interplay with external knowledge bases and other systems.

Publication:

arXiv e-prints

Pub Date:

January 2025

DOI:

10.48550/arXiv.2501.00823

arXiv:

arXiv:2501.00823

Bibcode:

2025arXiv250100823G

Keywords:

Computer Science - Machine Learning;
Computer Science - Artificial Intelligence;
Computer Science - Computation and Language

ADS

Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention

Abstract