Semantic Operators: A Declarative Model for Rich, AI-based Analytics Over Text Data

doi:10.48550/arXiv.2407.11418

Semantic Operators: A Declarative Model for Rich, AI-based Analytics Over Text Data

The semantic capabilities of language models (LMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems lack high-level abstractions to perform bulk semantic queries across large corpora. We introduce semantic operators, a declarative programming interface that extends the relational model with composable AI-based operations for bulk semantic queries (e.g., filtering, sorting, joining or aggregating records using natural language criteria). Each operator can be implemented and optimized in multiple ways, opening a rich space for execution plans similar to relational operators. We implement our operators in LOTUS, an open source query engine with a DataFrame API. Furthermore, we develop several novel optimizations that take advantage of the declarative nature of semantic operators to accelerate semantic filtering, clustering and join operators by up to $400\times$ while offering statistical accuracy guarantees. We demonstrate LOTUS' effectiveness on real AI applications including fact-checking, extreme multi-label classification, and search. We show that the semantic operator model is expressive, capturing state-of-the-art AI pipelines in a few operator calls, and making it easy to express new pipelines that achieve up to $180\%$ higher quality. Overall, LOTUS queries match or exceed the accuracy of state-of-the-art AI pipelines for each task while running up to 28$\times$ faster. LOTUS is publicly available at https://github.com/stanford-futuredata/lotus.

Publication:

arXiv e-prints

Pub Date:

July 2024

DOI:

10.48550/arXiv.2407.11418

arXiv:

arXiv:2407.11418

Bibcode:

2024arXiv240711418P

Keywords:

Computer Science - Databases;
Computer Science - Artificial Intelligence;
Computer Science - Computation and Language

ADS

Semantic Operators: A Declarative Model for Rich, AI-based Analytics Over Text Data

Abstract