Domain-agnostic Outlier Ranking Algorithms (DORA): A Configurable Pipeline for Outlier Detection in Scientific Datasets
Abstract
Automatic detection of out-of-distribution (OOD) samples or features is universally needed when working with scientific datasets. This capability can be used for cleaning datasets, e.g., flagging ground-truth labels with GPS or human entry error or identifying wrongly categorized objects in a catalog. It could also be used for discovery, e.g., to flag novel samples in order to guide instrument acquisition or scientific analysis. We are developing Domain-agnostic Outlier Ranking Algorithms (DORA) to provide a configurable pipeline that makes it easy for scientists to apply and evaluate outlier detection methods in a variety of domains. DORA allows users to configure outlier detection experiments by specifying the location of their dataset(s), the input data type, feature extraction methods, and which algorithms should be applied. DORA will support image, raster, time series, or feature vector input data types and outlier detection methods that include Isolation Forest, Discovery via Eigenbasis Modeling of Uninteresting Data (DEMUD), principal component analysis (PCA), Reed-Xiaoli (RX) detector, Local RX, negative sampling, and probabilistic autoencoder. Given a set of data samples, each algorithm assigns an outlier score to each sample. DORA provides results organization and visualization modules to help users process the results, including sorting samples by outlier score, evaluating outlier recall for a set of known/labeled outliers, clustering groups of similar outliers together, and causal graphs. To demonstrate the cross-divisional applicability of DORA, we are conducting outlier detection experiments for four use cases: 1) cleaning noisy ground-truth datasets (Earth Science), 2) detecting volcanic thermal anomalies (Earth Science), 3) flagging non-astrophysical objects in the Dark Energy Survey (astrophysics), and 4) novelty-guided onboard targeting of instruments for Mars rovers (planetary science). DORA will be made openly available as a pip-installable python package for use by the broader scientific community.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2021
- Bibcode:
- 2021AGUFMIN11B..02K