Building a serverless Data Lakehouse from spare parts
Abstract
The recently proposed Data Lakehouse architecture is built on open file formats, performance, and first-class support for data transformation, BI and data science: while the vision stresses the importance of lowering the barrier for data work, existing implementations often struggle to live up to user expectations. At Bauplan, we decided to build a new serverless platform to fulfill the Lakehouse vision. Since building from scratch is a challenge unfit for a startup, we started by re-using (sometimes unconventionally) existing projects, and then investing in improving the areas that would give us the highest marginal gains for the developer experience. In this work, we review user experience, high-level architecture and tooling decisions, and conclude by sharing plans for future development.
- Publication:
-
arXiv e-prints
- Pub Date:
- August 2023
- DOI:
- arXiv:
- arXiv:2308.05368
- Bibcode:
- 2023arXiv230805368T
- Keywords:
-
- Computer Science - Databases;
- Computer Science - Distributed;
- Parallel;
- and Cluster Computing;
- Computer Science - Software Engineering
- E-Print:
- Paper accepted for the Second International Workshop on Composable Data Management Systems (@ VLDB 2023)