Scalable Micro-planned Generation of Discourse from Structured Data
Abstract
We present a framework for generating natural language description from structured data such as tables; the problem comes under the category of data-to-text natural language generation (NLG). Modern data-to-text NLG systems typically employ end-to-end statistical and neural architectures that learn from a limited amount of task-specific labeled data, and therefore, exhibit limited scalability, domain-adaptability, and interpretability. Unlike these systems, ours is a modular, pipeline-based approach, and does not require task-specific parallel data. It rather relies on monolingual corpora and basic off-the-shelf NLP tools. This makes our system more scalable and easily adaptable to newer domains. Our system employs a 3-staged pipeline that: (i) converts entries in the structured data to canonical form, (ii) generates simple sentences for each atomic entry in the canonicalized representation, and (iii) combines the sentences to produce a coherent, fluent and adequate paragraph description through sentence compounding and co-reference replacement modules. Experiments on a benchmark mixed-domain dataset curated for paragraph description from tables reveals the superiority of our system over existing data-to-text approaches. We also demonstrate the robustness of our system in accepting other popular datasets covering diverse data types such as Knowledge Graphs and Key-Value maps.
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2018
- DOI:
- 10.48550/arXiv.1810.02889
- arXiv:
- arXiv:1810.02889
- Bibcode:
- 2018arXiv181002889L
- Keywords:
-
- Computer Science - Computation and Language
- E-Print:
- Accepted for Computational Linguistics journal on 17 Sep 2019