DEAL: Detecting Entities in the Astrophysics Literature

Shared Task

A good amount of astrophysics research makes use of data coming from missions and facilities such as ground observatories in remote locations or space telescopes, as well as digital archives that hold large amounts of observed and simulated data. These missions and facilities are frequently named after historical figures or use some ingenious acronym which, unfortunately, can be easily confused when searching for them in the literature via simple string matching. For instance, Planck can refer to the person, the mission, the constant, or several institutions. Automatically recognizing entities (i.e., Named Entity Recognition or NER) such as missions or facilities would help tackle this word sense disambiguation problem.

Task

The task consists on building a system (any strategy is valid as long as it is automatic and does not require human intervention) that is capable of identifying Named Entities in a dataset composed by full-text fragments and acknowledgements from the astrophysics literature.

Dataset

We provide a dataset with acknowledgements and full-text fragments from the NASA ADS with manually tagged astronomical facilities and other entities of interest e.g. archive, celestial objects. See here for full list, definitions and examples. We also provide baseline metrics obtained with an early version of the astroBERT model.

Evaluation & Baseline

Submissions will be scored using both the CoNLL-2000 shared task seqeval F1-Score at the entity level and scikit-learn’s Matthews correlation coefficient method at the token level. We also encourage authors to propose their own evaluation metrics. The baseline is computed using the astroBERT trained for this NER task (see the detailed baseline scores). The submissions will be scored on Codalab.

Challenge

Can a different model/architecture/approach be more successful at recognizing astronomical named entities?

Participants will have the opportunity to present their findings during the workshop and write a short paper. The best performant or interesting approaches might be invited to further collaborate with the NASA Astrophysical Data System.

Instructions for Participants

Participants should create accounts on Huggingface to access the data. Instructions on how to format your predictions and compute your scores on the training set are available in the Huggingface repository

Participants should also create accounts on Codalab and follow the instructions on how to submit their predictions for scoring.

Registration

Please fill in this form to report your intention to participate in the shared task

https://forms.office.com/r/KKpeKJBLy3

Timeline

  • Training+Validation Data Release: June 1, 2022
  • Validation Phase: June 1 - July 31, 2022
  • Test Data Release: August 1, 2022
  • Final Scoring Period: August 1 - August 26, 2022
  • System Report Submission: September 12, 2022 (Final and Firm)
  • Notification: October 7, 2022
  • Camera-ready Submission Deadline: October 24, 2022
  • Event Date: November 20, 2022 (online)

Contact

You can contact us at WIESP_AACL2022 [at] softconf.com.