Reliability of deep learning models for anatomical landmark detection: The role of inter-rater variability

doi:10.48550/arXiv.2411.17850

Reliability of deep learning models for anatomical landmark detection: The role of inter-rater variability

Automated detection of anatomical landmarks plays a crucial role in many diagnostic and surgical applications. Progresses in deep learning (DL) methods have resulted in significant performance enhancement in tasks related to anatomical landmark detection. While current research focuses on accurately localizing these landmarks in medical scans, the importance of inter-rater annotation variability in building DL models is often overlooked. Understanding how inter-rater variability impacts the performance and reliability of the resulting DL algorithms, which are crucial for clinical deployment, can inform the improvement of training data construction and boost DL models' outcomes. In this paper, we conducted a thorough study of different annotation-fusion strategies to preserve inter-rater variability in DL models for anatomical landmark detection, aiming to boost the performance and reliability of the resulting algorithms. Additionally, we explored the characteristics and reliability of four metrics, including a novel Weighted Coordinate Variance metric to quantify landmark detection uncertainty/inter-rater variability. Our research highlights the crucial connection between inter-rater variability, DL-models performances, and uncertainty, revealing how different approaches for multi-rater landmark annotation fusion can influence these factors.

Publication:

arXiv e-prints

Pub Date:

November 2024

DOI:

10.48550/arXiv.2411.17850

arXiv:

arXiv:2411.17850

Bibcode:

2024arXiv241117850S

Keywords:

Electrical Engineering and Systems Science - Image and Video Processing;
Computer Science - Computer Vision and Pattern Recognition

E-Print:

Accepted to SPIE Medical Imaging 2025

NASA/ADS

Reliability of deep learning models for anatomical landmark detection: The role of inter-rater variability

Abstract