A Maximum Entropy Approach to Identifying Sentence Boundaries
Abstract
We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of ., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Roman-alphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.
- Publication:
-
arXiv e-prints
- Pub Date:
- April 1997
- DOI:
- 10.48550/arXiv.cmp-lg/9704002
- arXiv:
- arXiv:cmp-lg/9704002
- Bibcode:
- 1997cmp.lg....4002R
- Keywords:
-
- Computer Science - Computation and Language
- E-Print:
- 4 pages, uses aclap.sty and covingtn.sty