Fresh Hacker News | Building Databases over a Weekend

▲Building Databases over a Weekend(denormalized.io)

108 points by ambrood 225 days ago | 7 comments

> In this post we take you on a walkthrough on how you can use DataFusion

Thought it was gonna be a "build your own SQLite" or something

▲Gepsens 224 days ago

I remember 2 years ago someone proposed adding stream processing in datafusion and PRs followed. But IMO stream processing is an entirely different beast, some people could use the sql engine of df for it though. There are rust projects like Arroyo

▲necubi 224 days ago

Creator of Arroyo here—we agree that stream processing is a different beast and needs different infrastructure from a batch engine like DataFusion.

Our approach has been to take pieces of DF (including the SQL frontend and expression engine) but embedding them in our own dataflow and operators. This allows us to support low latency, distribution, watermark processing, and consistent checkpointing.

But the great thing about DF is that it’s designed as a toolkit for SQL-oriented data processing, so it’s relatively easy to pick and use just the pieces you need.

▲knuckleheads 224 days ago

I’ve been messing around with sql and stream processing off and on the last few months via https://github.com/zmaril/bpfquery and then https://github.com/zmaril/zquery, so I very much feel this comment. I didn’t want to build out my own stream processing architecture in bpfquery, it was getting pretty tough pretty fast, so I switched over to a datafusion backend in zquery in the hopes that it could do stream processing well. It can handle static data really well, much better the home grown half engine I made in bpfquery, but streaming sql isn’t easily possible at the moment, everybody is building their own implementations and trying to upstream what they need, no coherent whole from data fusion. I was looking into making an attempt with arroyo sometime, but I think the authors want that code to be used as a standalone binary and not as a library in something else, based on my last impression of it a while back. So, maybe in a few years it’ll be as easy to make a streaming database as it is now to make a normal one, but that’s not the case currently.

▲hantusk 224 days ago

I agree. So many disparate solutions. The streaming sql primitives are by themselves good enough (e.g. `tumble`, `hop` or `session` windows), but the infrastructural components are always rough in real life use cases.

crossing fingers for solutions like `https://github.com/feldera/feldera` to be wrapped in a nice database, `https://materialize.com/` to solve their memory issues, or `https://clickhouse.com/docs/en/materialized-view` to solve reliable streaming consumption.

Various streaming processing frameworks often have domain specific languages with a lot of limitations of how to express aggregations and transformations.

▲def- 224 days ago

> [...] `https://materialize.com/` to solve their memory issues [...]

Disclaimer: I work at Materialize

Recently there have been major improvements in Materialize's memory usage as well as using disk to swap out some data.

I find it pretty easy to hook up to Postgres/MySQL/Kafka instances: https://materialize.com/blog/materialize-emulator/

▲knuckleheads 224 days ago

Yeah I have a feeling something like polars for streaming would be super popular and useful, but it just hasn't happened yet. It's much easier to just do say kafka and a long running python script and write out the transformations by hand, than it is to use anything on the market right now. None of the current streaming processors want to be embedded as far as I can tell, that's not where the money is. They all want to be paid to run it in the cloud for you and follow that vc playbook model. Which, fair! I do think there's a lot of space out that isn't being occupied though and I hope somebody tries to fill it soon.

(As an aside, feldera doesn't want to be embedded into your app, materialize either, and clickhouse might just pull a great streaming library out from nowhere, they seem to be good at just doing stuff like that).

▲maximus93 224 days ago

Great discussion here! At AI Squared, we have also been exploring the evolving landscape of stream processing and SQL engines. While batch engines like DataFusion excel at handling static data, we recognize the challenges around integrating streaming capabilities and infrastructure seamlessly.

Our focus has been on simplifying data activation pipelines with tools like Multiwoven, which aims to bridge the gap between static and dynamic data needs by supporting connectors for both traditional databases and real-time platforms like Kafka. However, the need for more embedded, developer-friendly streaming solutions is clear, and it’s exciting to see the progress in projects like Arroyo, Materialize, and ClickHouse.

For us, the balance lies in usability and flexibility—how can we empower teams to embed robust data capabilities (whether streaming or batch) into their workflows without overloading on infrastructure complexity? As this ecosystem evolves, we’re optimistic about collaborating and contributing to solutions that make streaming SQL as accessible as traditional SQL.

Looking forward to seeing how this space develops—and kudos to the teams pushing boundaries! https://github.com/Multiwoven/multiwoven/

▲dangoodmanUT 224 days ago

this post feels like it's skipping over a lot of code that could be included

▲JoeOfTexas 224 days ago

Step 1. Choose a color Step 2. Finish the database

▲ambrood 224 days ago

thanks for the feedback! the first version had a lot more detailed code but decided to go with linking to our GitHub than copying all the code. Wanted to illustrate the core touch points involved in extending DF.

▲ztratar 224 days ago

I almost thought the opposite, but im no db guy.

▲alamb 224 days ago

BTW here is a fun exercise that takes this idea to the extreme. Who can build a custom file format that gets the best ClickHouse performance (on DataFusion):

https://github.com/apache/datafusion/issues/13448

Disclaimer I am on the PMC of Apache DataFusion, so am totally a fan boy.

▲ 224 days ago