Thought it was gonna be a "build your own SQLite" or something
Our approach has been to take pieces of DF (including the SQL frontend and expression engine) but embedding them in our own dataflow and operators. This allows us to support low latency, distribution, watermark processing, and consistent checkpointing.
But the great thing about DF is that it’s designed as a toolkit for SQL-oriented data processing, so it’s relatively easy to pick and use just the pieces you need.
crossing fingers for solutions like `https://github.com/feldera/feldera` to be wrapped in a nice database, `https://materialize.com/` to solve their memory issues, or `https://clickhouse.com/docs/en/materialized-view` to solve reliable streaming consumption.
Various streaming processing frameworks often have domain specific languages with a lot of limitations of how to express aggregations and transformations.
Disclaimer: I work at Materialize
Recently there have been major improvements in Materialize's memory usage as well as using disk to swap out some data.
I find it pretty easy to hook up to Postgres/MySQL/Kafka instances: https://materialize.com/blog/materialize-emulator/
(As an aside, feldera doesn't want to be embedded into your app, materialize either, and clickhouse might just pull a great streaming library out from nowhere, they seem to be good at just doing stuff like that).
Our focus has been on simplifying data activation pipelines with tools like Multiwoven, which aims to bridge the gap between static and dynamic data needs by supporting connectors for both traditional databases and real-time platforms like Kafka. However, the need for more embedded, developer-friendly streaming solutions is clear, and it’s exciting to see the progress in projects like Arroyo, Materialize, and ClickHouse.
For us, the balance lies in usability and flexibility—how can we empower teams to embed robust data capabilities (whether streaming or batch) into their workflows without overloading on infrastructure complexity? As this ecosystem evolves, we’re optimistic about collaborating and contributing to solutions that make streaming SQL as accessible as traditional SQL.
Looking forward to seeing how this space develops—and kudos to the teams pushing boundaries! https://github.com/Multiwoven/multiwoven/
https://github.com/apache/datafusion/issues/13448
Disclaimer I am on the PMC of Apache DataFusion, so am totally a fan boy.