Disambiguating Symbolic Expressions in Informal Documents
Abstract
We propose the task of disambiguating symbolic expressions in informal STEM documents in the form of LaTeX files - that is, determining their precise semantics and abstract syntax tree - as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid LaTeX before overfitting. Consequently, we describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account.
- Publication:
-
arXiv e-prints
- Pub Date:
- January 2021
- DOI:
- 10.48550/arXiv.2101.11716
- arXiv:
- arXiv:2101.11716
- Bibcode:
- 2021arXiv210111716M
- Keywords:
-
- Computer Science - Machine Learning
- E-Print:
- ICLR 2021 conference paper