Disambiguating Symbolic Expressions in Informal Documents

doi:10.48550/arXiv.2101.11716

Disambiguating Symbolic Expressions in Informal Documents

We propose the task of disambiguating symbolic expressions in informal STEM documents in the form of LaTeX files - that is, determining their precise semantics and abstract syntax tree - as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid LaTeX before overfitting. Consequently, we describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account.

Publication:

arXiv e-prints

Pub Date:

January 2021

DOI:

10.48550/arXiv.2101.11716

arXiv:

arXiv:2101.11716

Bibcode:

2021arXiv210111716M

Keywords:

Computer Science - Machine Learning

E-Print:

ICLR 2021 conference paper

NASA/ADS

Disambiguating Symbolic Expressions in Informal Documents

Abstract