A Maximum Entropy Approach to Identifying Sentence Boundaries

doi:10.48550/arXiv.cmp-lg/9704002

A Maximum Entropy Approach to Identifying Sentence Boundaries

We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of ., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Roman-alphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.

Publication:

arXiv e-prints

Pub Date:

April 1997

DOI:

10.48550/arXiv.cmp-lg/9704002

arXiv:

arXiv:cmp-lg/9704002

Bibcode:

1997cmp.lg....4002R

Keywords:

Computer Science - Computation and Language

E-Print:

4 pages, uses aclap.sty and covingtn.sty

NASA/ADS

A Maximum Entropy Approach to Identifying Sentence Boundaries

Abstract