UniEntrezDB: Large-scale Gene Ontology Annotation Dataset and Evaluation Benchmarks with Unified Entrez Gene Identifiers
Abstract
Gene studies are crucial for fields such as protein structure prediction, drug discovery, and cancer genomics, yet they face challenges in fully utilizing the vast and diverse information available. Gene studies require clean, factual datasets to ensure reliable results. Ontology graphs, neatly organized domain terminology graphs, provide ideal sources for domain facts. However, available gene ontology annotations are currently distributed across various databases without unified identifiers for genes and gene products. To address these challenges, we introduce Unified Entrez Gene Identifier Dataset and Benchmarks (UniEntrezDB), the first systematic effort to unify large-scale public Gene Ontology Annotations (GOA) from various databases using unique gene identifiers. UniEntrezDB includes a pre-training dataset and four downstream tasks designed to comprehensively evaluate gene embedding performance from gene, protein, and cell levels, ultimately enhancing the reliability and applicability of LLMs in gene research and other professional settings.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2024
- arXiv:
- arXiv:2412.12688
- Bibcode:
- 2024arXiv241212688M
- Keywords:
-
- Computer Science - Databases