ACE-2005-PT: Corpus for Event Extraction in Portuguese

doi:10.48550/arXiv.2408.16928

ACE-2005-PT: Corpus for Event Extraction in Portuguese

Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55\% and 87.55\% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.

Publication:

arXiv e-prints

Pub Date:

August 2024

DOI:

10.48550/arXiv.2408.16928

arXiv:

arXiv:2408.16928

Bibcode:

2024arXiv240816928F

Keywords:

Computer Science - Computation and Language;
Computer Science - Artificial Intelligence

E-Print:

SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (2024)

NASA/ADS

ACE-2005-PT: Corpus for Event Extraction in Portuguese

Abstract