Supervised Machine Learning Approach for Classifying Earth Science Publications
Abstract
The data collections archived and distributed by the GES DISC NASA data center are widely utilized for various Earth Science studies. As these collections are created, many research works are published regarding these collections' algorithms, their validation, and their applications. As NASA data centers collect these publications for public use, it is helpful to categorize them based on how they relate to their associated datasets. Specifically, whether the publication linked to the GES DISC dataset is using it for applicational research, describing the algorithm used for the dataset creation, validating the dataset, or providing a general overview of the data collection. Currently, this process requires simple manual labeling, and as such, it may be possible to solve via automation. To approach this problem, machine learning classifiers were developed to predict a publication's category. Manually labeled publications were used as the training data for the supervised machine learning algorithms, specifically Random Forest and Multinomial Naïve Bayes. After balancing the dataset and implementing the Multinomial Naïve Bayes algorithm, the classification accuracy achieved was substantially higher than the baseline accuracy, thus significantly improving the efficiency of publication labeling.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2021
- Bibcode:
- 2021AGUFMIN45C0470D