433 points by dbreunig 1 day ago | 29 comments
wenc 23 hours ago
I'm a big fan of DuckDB and I do geospatial analysis, mostly around partitioning geographies (into Uber H3 hexagons), calculating Haversine distances, calculating areas of geometries, figuring out which geometry a point falls in, etc. Many of these features have existed in some form or other in geopandas or postgis, so DuckDB's spatial extensions bring nothing new.

But what DuckDB as an engine does is it lets me work directly on parquet/geoparquet files at scale (vectorized and parallelized) on my local desktop. It beats geopandas in that respect. It's a quality of life improvement to say the least.

DuckDB also has an extension architecture that admits more exotic geospatial features like Hilbert curves, Uber H3 support.

https://duckdb.org/docs/stable/extensions/spatial/functions....

https://duckdb.org/community_extensions/extensions/h3.html

everybodyknows 4 minutes ago
[delayed]
sroerick 4 hours ago
I totally agree with this. DuckDB for me was a huge QoL improvement just working with random datasets. I found it much easier to explore datasets using DuckDB rather than Pandas, Postgres or Databricks.

The spatial features were just barely out when I was last doing a lot of heavy geospatial work, but even then they were very nice.

An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.

pandas is very nice and was my bread and butter for a long time, but I frequently ran into memory issues and problems at scale with pandas, which I would never hit with polars or duckdb. I'm not sure if this holds true today as I know there's been updates, but it was certainly a problem then. Using geopandas ran into the same issues.

Just using GDAL and other libraries out of the box is frankly not a great experience. If you have a QGIS (another wonderful tool) workflow, it's frustrating to be dropping into Jupyter notebooks to do translations, but that seemed to be the best option.

In general, it just feels like geospatial analysis is about 10 years behind regular data analysis. Shapefiles are common because of ESRI dominance, but frankly not a great format. PostGIS is great, geopandas is great, but there's a lot more things in the data ecosystem than just Postgres and pandas. PowerBI barely had geospatial support a couple years ago. I think PowerBI Shapemaps exclusively used TopoJSON?

All of this is to say, DuckDB geospatial is very cool and helpful.

ngrilly 23 minutes ago
> An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.

What was shockingly useful in PowerBI compared to DuckDB?

wodenokoto 14 hours ago
Why do you use haver-sine over geodesic or reprojection?

I’ve been doing the reprojection thing, projecting coordinates to a “local” CRS, for previous projects mainly because that’s what geopandas recommend and is built around, but I am reaching a stage where I’d like to calculate distance for objects all over the globe, and I’m genuinely interested to learn what’s a good choice here.

code_biologist 12 hours ago
Just an app dev, not a geospatial expert, but reprojection always seemed like something a library should handle under the hood unless one has specific needs. I'm used to the ergonomics / moron-proofing of something like Postgis' `ST_Distance(geography point1, geography point2)` and it gives you the the right answer in meters. You can easily switch to spherical or Cartesian distances if you need distance calculations to go faster. `ST_Area(geography geog)` and it gives you the size of your shape in square meters wherever on the planet.
wenc 3 hours ago
Reprojection is accurate locally but inaccurate at scale.

Geodesics are the most accurate (Vincenty etc) but are computationally heavy.

Haversine is a nice middle ground.

colkassad 9 hours ago
Look into Vincenty [1] or Karney (for a more robust solution) [2]. Vincenty should be good enough for most use cases.

[1] https://en.wikipedia.org/wiki/Vincenty's_formulae

[2] https://github.com/pbrod/karney

12 hours ago
Demiurge 21 hours ago
> Prior to this, getting up and running from a cold-start might’ve required installing or even compiling severall OSS packages, carefully noting path locations, standing up a specialized database… Enough work that a data generalist might not have bothered, or their IT department might not have supported it.

I've been able to "CREATE EXTENSION postgis;" for more than a decade. There have been spatial extensions for PG, MySQL, Oracle, MS SQL Server, and SQLite for a long time. DuckDB doesn't make any material difference in how easy it is to install.

wenc 20 hours ago
That requires data to already be in Postgres, otherwise you have to ETL data into it first.

DuckDB on the other hand works with data as-is (Parquet, TSV, sqlite, postgres... whether on disk, S3, etc.) with requiring an ETL step (though if the data isn't already in a columnar format, things are gonna be slow... but it will still work).

I work with Parquet data directly with no ETL step. I can literally drop into Jupyter or a Python REPL and duckdb.query("from '*.parquet'")

Correct me if I'm wrong, but I don't think that's possible with Postgis. (even pg_parquet requires copying? [1])

[1] https://www.crunchydata.com/blog/pg_parquet-an-extension-to-...

Demiurge 20 hours ago
Yeah, if you want to work with GeoParquet, and you want to keep your data in that format. I can see how that's easer to use your example. That's not what a lot of geospatial data is in. You might have shapefiles, geopackages, geojsons, who knows? There is a lot of software, from QGIS to ESRI to work with different formats to solve different problems. I don't think GeoParquet, even though it might be the fastest geospatial vector data format right now, is that common, and the article did not claim that either. So, given an average user trying to answer some GIS question, some ETL is pretty much a given, on average. And given that, installing PostGIS and installing DuckDB, both require some ETL, and learning some query and analytics language. DuckDB might be an improvement, but it's certainly not as much of a leap as quote is making it out to be.
sroerick 4 hours ago
From the DuckDB geospatial docs:

SELECT * FROM './path/to/some/shapefile/dataset.shp';

COPY table TO 'some/file/path/filename.geojson' WITH (FORMAT gdal, DRIVER 'GeoJSON', LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');

This seems extremely helpful to me.

Demiurge 4 hours ago
And ogr2ogr is also very helpful:

ogr2ogr -f "PostgreSQL" -t_srs "EPSG:2274" PG:"host=host user=user dbname=database password=trustno1 schemas=schema" shapefile.shp -nln new_name_for_table

I'll give DuckDB a shot and compare to postgis. Eventually, I think the issue for me would be that I use all the vector outputs via apps connecting to the same database.

jeffbee 19 hours ago
Yeah, just an example of a QoL issue with DuckDB: even though it can glob files in other cases, the way it passes parameters to GDAL means that globs are taken literally instead of expanded. So I can't query a directory with thirty million geojson files. This is not a problem in geopandas because ipython, being a full interactive development environment, allows me to produce the glob any way I choose.

I think this is a fundamental problem with the SQL pattern. You can try to make things just work, but when they fall then what?

maxxen 19 hours ago
I think this is just cause it hasn't been implemented in spatial yet. DuckDB is currently going through a pretty big refactor of the way we glob/scan/union multiple files with all the recent focus on data lake formats, but my plan is to get to it in spatial after next release when that part of the code has stabilized a bit.
ffsm8 14 hours ago
> fundamental problem with the SQL pattern.

SQL is a DSL and yes, all Domain Specific Languages will only enable what the engine parsing the DSL supports.

But all SQL database I'm aware of let you write custom extensions, which are exactly that: they extend the base functionality of the database with new paradigms. I.e postgis enabling geospatial in postgres or the extensions that enable fuzzy-matching/searching.

And as SQL is pretty much a turing-complete DSL, there is very little you can't do with it, even if the syntax might not agree with everyone

edoceo 20 hours ago
Not wrong. Load to PG, then query. Duck UVP is like bringing 8 common tools/features under one tent.
paradox460 3 hours ago
The original article feels a tremendous amount like another piece of DuckDB marketing, from the breathless admiration to the baseless claims like the title
jokoon 9 hours ago
I tested spatialite, it works okay, but the setup is a bit tedious when inserting data.
rcdwealth 8 hours ago
[dead]
larsiusprime 23 hours ago
“import geopandas” also exists and has for some time. Snark aside, WHAT is special about duckDB? I wish the author had actually shown some practical examples so I could understand their claims better.
maxxen 22 hours ago
I replied to another comment, but I think a big part is that duckdbs spatial extension provides a SQL interface to a whole suite of standard foss gis packages by statically bundling everything (including inlining the default PROJ database of coordinate projection systems into the binary) and providing it for multiple platforms (including WASM). I.E there are no transitive dependencies except libc.

Yes, DuckDB does a whole lot more, vectorized larger-than-memory execution, columnar compressed storage and a ecosystem of other extensions that make it more than the sum of its parts. But while Ive been working hard on making the spatial extension more performant and more broadly useful (I designdd a new geometry engine this year, and spatial join optimization just got merged on the dev-branch), the fact that you can e.g. convert too and from a myriad of different geospatial formats by utilizing GDAL, transforming through SQL, or pulling down the latest overture dump without having the whole workflow break just cause you updated QGIS has probably been the main killer feature for a lot of the early adopters.

(Discmaimer, I work on duckdb-spatial @ duckdblabs)

carstonh 6 hours ago
Here's an example building on @maxxen's work - because DuckDB (+spatial extension) can compile to Wasm, I built an in-browser Shapefile to CSV converter tool: https://www.honeycombmaps.com/tools/shapefile-to-csv-convert...
timschmidt 22 hours ago
I'm not the OP, but thank you for such a detailed answer. The integration and reduced barriers to entry you mention mirror my own experiences with tooling in another area, and your explanation made parallels clear.
dahauns 6 hours ago
As someone unfamiliar with DuckDB but at least somewhat with geospatial tools (it's been a few years, though): Dang - see, now that is seriously cool. The whole ETL shebang always was the biggest hassle, even with serious commercial tools, and the idea of a stable, all-in-one, ready-to-go layer is incredibly appealing.

It's just something the writer of the article should probably have at least mentioned when going full hyperbole with the title (hey, after reading this, it might actually be justified! :) ).

Rereading the article that focuses on the one thing that isn't a standout (the installation itself), though, I can't help but chuckle and think "worst sales pitch ever". ;)

Amadiro 10 hours ago
Is there any strong reason to use GeoParquet instead of straight up parquet if all I'm interested in is storing and operating on lat/lons?

I'm curious if it compresses them better or something like that. I see lots of people online saying it compresses well (but mostly compared to .shp or similar) but normal parquet (.gz.parquet or .snappy.parquet) already does that really well. So it's not clear to me if I should spend time investigating it...

I mostly process normal parquet with spark and sometimes clickhouse right now.

carstonh 6 hours ago
Based on my reading of the GeoParquet spec, the main difference is that geometries are stored as WKB using Parquet's byte array type. Byte arrays can be delta-encoded. There is also some additional metadata stored like CRS and a bounding box.

When using EPSG:4326 lat/lons, I don't think GeoParquet would give you any benefits over just having separate columns (this is what I typically do, and it's plenty fast).

If you are using range requests to fetch only parts of parquet files at a time, you could potentially sort your data using a hilbert curve, which could limit the number of row groups that need to be fetched to execute a query.

larsiusprime 21 hours ago
This is an excellent reply and what I wish the article had been, thanks!
vemom 21 hours ago
Oh yeah the article does stop abruptly after "2 lines to install"
jjtheblunt 22 hours ago
duckdb has parquet support and can operate, in SQL syntax, on enormous 'tables' spread across huge collections of parquet files as if one virtual file. i believe the underlying implication is opportunities to leverage vector instructions on parquet. it's very "handy".
getnormality 22 hours ago
Everything is special about DuckDB. Pandas is way, way behind the state of the art in tabular data analysis.
21 hours ago
serjester 4 hours ago
Ask anyone that's just starting out with geo pandas about their experience, and I'd be shocked if anyone calls it intuitive and straightforward. To geopandas credit, I think they just inherited many of Pandas' faults (why does a user need to understand indexes to do basic operations, no multi core support, very poor type hinting, etc).
dbreunig 22 hours ago
Author here: what's special is that you can go from 0 to spatial data incredibly quickly, in the data generalist tool you're already using. It makes the audience of people working with geospatial data much bigger.

(Geopandas is great, too.)

dopidopHN 21 hours ago
I’m very familiar with Postgres and spinning one with postgis seems easy enough. Do I get more with duckdb?

Most of the time I store locations and compute distance to them. Would that being faster to implement with duckdb

wenc 20 hours ago
Probably no difference for your use-case (ST_Distance). If you already have data in Postgres, you should continue using Postgis.

In my use case, I use DuckDB because of speed at scale. I have 600GBs of lat-longs in Parquet files on disk.

If I wanted to use Postgis, I would have to ingest all this data into Postgres first.

With DuckDB, I can literally drop into a Jupyter notebook, and do this in under 10 seconds, and the results come back in a flash: (no need to ingest any data ahead of time)

  import duckdb
  duckdb.query("INSTALL spatial; LOAD spatial;")
  duckdb.query("select ST_DISTANCE(ST_POINT(lng1, lat1), ST_POINT(lng2, lat2)) dist from '/mydir/*.parquet'")
vasco 15 hours ago
I haven't yet understood this pattern (and I tried using duckdb). Unless you're only ever going to query those files once or twice in your life, importing them into postgres shouldn't be that long and then you can do the same or more than with DuckDB.

Also as a side note, is everyone just using DuckDB in memory? Because as soon as you want some multiple session stuff I'd assume you'd use DuckDB on top of a local database, so again I don't see the point but I'm sure I'm missing something.

wenc 14 hours ago
> importing them into postgres shouldn't be that long and then you can do the same or more than with DuckDB.

Usually new data is generated regularly and would require creating a separate ETL process to ingest into Postgres. With DuckDB, no ETL is needed. New Parquet files are just read off the disk.

> Also as a side note, is everyone just using DuckDB in memory?

DuckDB is generally used as a single-user, and yes in-memory use case is most common. Not sure about use cases where a single user requires multiple sessions? But DuckDB does have read concurrency, session isolation etc. I believe write serialization is supported in multiple sessions.

With Parquet files, it's append-only so the "write" use-cases tend to be more limited. Generally another process generates those Parquet files. DuckDB just works with them.

indeyets 12 hours ago
> Usually new data is generated regularly

This part was not obvious. In a lot of cases geodata is mostly stable and reads/searches dominate over appends. And that’s why we keep this in DB (usually postgis, yes).

So DuckDB is optimised for very different use case and it is not always obvious when it’s mentioned

wenc 5 hours ago
This is the more trivial use case (static data, heavy reads) and DuckDB is absolutely optimized for this use case.

DuckDB also provides a vectorized, parallelized engine. When I run a query all of my 32 cores light up on htop.

simlevesque 10 hours ago
But DuckDB works just as well with static data.
lugarlugarlugar 14 hours ago
I think the main point is not having to store a duplicate of the 600GB of input data.
touisteur 10 hours ago
And now I'm curious whether there's a way to actually index external files (make these queries over 600GB faster) and have this index (or many indices) be persistent. I might have missed that when I looked at the docs...
wenc 2 hours ago
If the data is in Parquet they are already indexed in a sense. No further indexing necessary.

If they are stored in DuckDB’s native format (which I don’t use), it supports some state of the art indices.

https://duckdb.org/docs/stable/sql/indexes.html

I find Parquet plenty fast though.

joshvm 22 hours ago
I haven't used duckDB but the real comparison is presumably postgis? Which is also absent from the discussion, but I think what the author alludes to.

I have no major qualm with pandas and geopandas. However I use it when it's the only practical solution, not because I enjoy using it as a library. It sounds like pandas (or similar) vs a database?

stevage 20 hours ago
Yeah, PostGIS is readily available, and postgres is much more widely used than DuckDB. Either I don't understand OP's argument for why this is so important or I just don't buy it.

If you're using JavaScript you install Turf. The concept that you can readily install spatial libraries is hardly earth shattering.

tom_m 19 hours ago
Convenience will always be a personal preference.
tsss 22 hours ago
For one it doesn't have the god awful pandas API.
hokusad 8 hours ago
Yeah, that's one of my favourite aspects to it
tmpz22 22 hours ago
I've been researching DuckDB - while it has many technical merits I think the main argument will be ease of use. It has a lot of the operational advantages of sqlite paired with strong extensibility and good succinct documentation.

Folks who have been doing DevOps work are exasperated with crummy SaaS vendors or antiquated OSS options that have a high setup cost. DuckDB is just a mature project that offers an alternative, hence an easy fan favorite among hobbyists (I imagine at scale the opportunity costs change and it becomes less attractive).

wenc 20 hours ago
How is the adoption among DevOps folks?

I'm still getting feedback that many devs are not too comfortable with reading and writing SQL. They learned simple SELECT statements in school, but get confused by JOINs and GROUP BYs.

edoceo 20 hours ago
Random voice here: they should get better at SQL. Not 9 joins and GROUP BY and HAVING and other magic. But two joins and GROUP for sure. If one gets the 3NF already then join and others are a quick (2 week) learn.

I'd pick that before traveling the DuckDB path.

Yeroc 19 hours ago
Since when is strong SQL knowledge not a core skill for developers? I suppose it's the rise of frontend vs backend specialization that is the cause?
wenc 14 hours ago
I don't know if it's ever been a core skill. You can ask any dev. Many tell me SQL is not something they use every day, so even if they learned it, they usually forget it in no time.

SQL is burned into my muscle memory because I work on ML models with large datasets. It took me a lot of trial and error to get decent at SQL. I imagine most devs just don't have the reps because their work rarely exposes them to SQL.

vasco 15 hours ago
There's no point in learning any much deeper SQL anymore, AI assistants have largely solved SQL querying. Just ask for what you want with natural language.
wenc 14 hours ago
I don't quite agree.

SQL queries are one area where correctness matters a lot, because downstream applications rely on them to be correct. If you get a query wrong, especially in an ETL process, you can generate a lots of garbage data for a very long time (I'm speaking from lived experience). It might take a long time to correct (via backfill) and sometimes the original data might no longer be available.

I use LLMs to generate ideas for writing complex SQL, but I always need to understand the query 100% first before I deploy. Mistakes are costly, and sometimes irreversible. I need to trust but verify, which requires deep SQL knowledge.

To write correct SQL, you not only have to know the syntax, but also the schema (easily provided to an LLM) and the expected data values (you can provide a sampling to an LLM which helps, but domain knowledge helps even more). There are lots of surprising things in data that only reveal themselves via visual inspection. (though I think with MCO, an LLM will one day be able to inspect the data itself by sampling the database)

vasco 10 hours ago
Testing SQL for correctness is besides the point of how the SQL was generated. It's important of course to review and test anything. My point is SQL writing is particularly good already with current state of the art of assistants for it to be worth it to spend much time hand crafting complex queries.
usr1106 14 hours ago
While AI assistants can give ideas how to address something I'd strongly contradict the statement "there is no point in learning". You should understand the AI suggestion because it could also be very bad or working incorrectly. (If it hallunicated, you'll note the syntax error, that does indeed not require learning...)
tmpz22 7 hours ago
It’s not a good database for many storage workloads because it only allows a single process to write to it. For example you wouldn’t want 20 app servers writing their logs or metrics or traces to a single duck db instance. You would instead have them write rotating log files that are somehow aggregated then queried through duck db via something like grafana.

So the operational savings are more for using it as a light weight data science tool that excels in specific workloads like geospacial. Wherever you need to do those computations more locally it will excel.

I don’t really think sql is a mandated devops skill, though a basic understanding of databases certainly is. Between online content, LLMs, as long as you can iteratively improve a query part by part through explain analyze etc it’ll be sufficient.

wenc 3 hours ago
DuckDB can be thought of as an analytics engine rather than storage engine. It’s OLAP rather than OLTP.
jparishy 22 hours ago
I work on geospatial apps and the software I think I am most excited about is https://felt.com/. I want to see them expand their tooling such that maps and data source authentication/authorization was controllable by the developer, to enable tenant isolation with proprietary data access. They could really disrupt how geospatial tech gets integrated into consumer apps.

This article doesn't acknowledge how niche this stuff is and it's a lot of training to get people to up to speed on coordinate systems, projections, transformations, etc. I would replace a lot of my custom built mapping tools with Felt if it were possible, so I could focus on our core geospatial processes and not the code to display and play with it in the browser, which is almost as big if not bigger in terms of LOC to maintain.

As mentioned by another commenter, this DuckDB DX as described is basically the same as PostGIS too.

dbreunig 22 hours ago
Author here: the beauty of DuckDB spatial is that the projections and CRS options are hidden until you need them. For 90% of geospatial data usage people don't and shouldn't need to know about projections or CRS.

Yes, there are so many great tools to handle the complexity for the capital-G Geospatial work.

I love Felt too! Sam and team have built a great platform. But lots of times a map isn't needed; an analyst just needs it as a column.

PostGIS is also excellent! But having to start up a database server to work with data doesn't lend itself to casual usage.

The beauty of DuckDB is that it's there in a moment and in reach for data generalists.

korkoros 21 hours ago
My experience has been that data generalists should stay away from geospatial analysis precisely because they lack a full appreciation of the importance of spatial references. I've seen people fail at this task in so many ways. From "I don't need a library to reproject, I'll just use a haversine function" to "I'll just do a spatial join of these address points in WGS84 to these parcels in NAD27" to "these North Korean missiles aren't a threat because according to this map using a Mercator projection, we are out of range."

DuckDB is great, but the fact that it makes it easier for data generalists to make mistakes with geospatial data is mark against it, not in its favor.

groggo 19 hours ago
jparishy 22 hours ago
I think we're mostly making the same point about complexity, ya.

To me, I think it's mostly a frontend problem stopping the spread of mapping in consumer apps. Backend geo is easy tbh. There is so much good, free tooling. Mapping frontend is hell and there is no good off the shelf solution I've seen. Some too low level, some too high level. I think we need a GIS-lite that is embeddable to hide the complexity and let app developers focus on their value add, and not paying the tax of having frontend developers fix endless issues with maps they don't understand.

edit: to clarify, I think there's a relationship between getting mapping valued by leadership such that the geo work can be even be done by analysts, and having more mapping tools exist in frontend apps such that those leaders see them and understand why geo matters. it needs to be more than just markers on the map, with broad exposure. hence my focus on frontend web. sorry if that felt disjointed

dbreunig 21 hours ago
Not disjointed at all. That last topic is the big challenge to solve.
jandrewrogers 20 hours ago
> it's a lot of training to get people to up to speed on coordinate systems, projections, transformations, etc

This can mostly be avoided entirely with a proper spheroidal reference system, computational geometry implementation, and indexing. Most uses of geospatial analytics are not cartographic in nature. The map is at best a presentation layer, it is not the data model, and some don’t use a map at all. Forcing people to learn obscure and esoteric cartographic systems to ask simple intuitive questions about geospatial relationships is a big part of the problem. There is no reason this needs to be part of the learning curve.

I’ve run experiments on unsophisticated users a few times with respect to this. If you give them a fully spheroidal WGS84 implementation for geospatial analytics, it mostly “just works” for them anywhere on the globe and without regard for geospatial extent. Yes, the software implementation is much less trivial but it is qualitatively superior UX because “the world” kind of behaves how people intuit it should without having to know anything about projections, transforms, etc. And to be honest, even if you do know about projections and transforms, the results are still often less than optimal.

The only issue that comes up is that a lot of cartographic visualization toolkits are somewhat broken if you have global data models or a lot of complex geometry. Lots of rendering artifacts. Something else to work on I guess.

maxxen 19 hours ago
Im inclined to agree, but unfortunately a huge amount of the existing data and processes in this space does not assume a spheroidal earth and come provided with a coordinate reference system. Ultimately there are also some domains where you got data that you explicitly don't want to interpret using spheroidal semantics, e.g. when working with a city plan - in which case the map _is_ the data model, and you definitely want the angles of a triangle to sum up to 180.
jparishy 19 hours ago
Similarly I see it as it an inevitable that you will deal with a problem related to these issues, then have a massive blocker if you aren't already familiar with the details because debugging requires more than a cursory understanding for hard problems. In the alternative, you don't know the problems exist and the code is broken. I don't think it's as simple as abstracting away the entire concept, which is what I would say is too high level like I mentioned above. I don't know the right answer here honestly, I think it will be disruptive when someone figures it out
groggo 19 hours ago
> the software implementation is much less trivial

Aren't most geospatial tools just doing simple geometry? And therefore need to work on some sort of projection?

If you can do the math on the spheroidal model, ok you get better results and its easier to intuit like you said, but it's much more complicated math. Can you actually do that today with tools like QGIS and GDAL?

jofer 6 hours ago
For what it's worth, you _can't_ use spherical approaches for most data. They're only used for points, in practice. Your spatial data is inherently stored/generated in ways that don't allow spherical approaches as soon as you start working with polygons, let alone things like rasters.

Yes, spherical representations of polygon data exist, but the data you import has already been "split" and undoing that is often impossible, or at best non-trivial. And then rasters are fundamentally impossible to represent that way.

Analysis uses projections for that reason. Spherical approaches aren't fundamentally "better" for most use cases. They're only strictly better if everything you're working with is a point.

There's more to geospatial than point datasets.

groggo 54 minutes ago
A good point. Certainly for raster analysis it doesn't make sense.

But any type of vector data could be modeled on a sphere, right? Points, shapes, lines. And I saw "better" because even the best suited projection will have some small amount of distortion.

Either way, most things use planer geometry so projections are necessary, and you need to have some understanding of how all that works

jandrewrogers 16 hours ago
Many do use simple geometry. This causes endless headaches for people who are not cartographers, they don’t expect that. The good geospatial tools usually support spheroidal models but it is not the default, you have to know to explicitly make sure it uses that (many people assume that is the default).

An additional issue is that the spheroidal implementations have undergone very little optimization, perhaps because they are not the defaults. So when people figure out how to turn them on, performance is suddenly terrible. Now you have people that believe spheroidal implementations are terribly slow, when in reality they just used a pathologically slow implementation. Really good performance-engineered spheroidal implementations are much faster than people assume based on the performance of open source implementations.

Demiurge 18 hours ago
This is not really a problem, unless you’re trying to simulate some 3D space orbits, physics. The crossover from geo INFORMATION systems to geo simulation systems is a bit rough, but the projections and calculations on projected cartesian space are enough for many typical questions, like distance, area, routing. However, even topology support starts getting specialized, and the use cases are more niche. I think it’s asking a bit too much from a database/storage layer to do efficient calculations outside of those supported by GEOS. At this point, you might want to import the relevant data into higher level applications.
jandrewrogers 16 hours ago
Speaking for myself, I was not referring to any kind of simulation systems. This is a standard requirement of many operational geospatial data models, and there are a lot of these in industry. Anything that works from a projection is a non-starter, this causes demonstrable issues for geospatial analysis at scale or if any kind of precision is required. Efficient calculation just means efficient code, there is nothing preventing this from existing in open source beyond people writing it. Yes, you may be able to get away with it if your data model is small, both geographically and data size, but that does not describe every company.

It is entirely possible to do this in databases. That is how it is actually done. The limitations of GEOS are not the limitations of software, it is not a particularly sophisticated implementation (even PostGIS doesn’t use it for the important parts last I checked). To some extent you are affirming that there is a lack of ambition in this part of the market in open source.

urschrei 9 hours ago
I wouldn't say it's correct to say that GEOS isn't particularly sophisticated. A lot of (certainly not all) GEOS algorithms are now ported from JTS, the primary author of which is Martin Davis (aka Dr JTS), who works at Crunchy Data, who provide the PostGIS extension. So the chain (again, mostly) goes JTS -> GEOS -> {PostGIS, Shapely} -> … . Martin's work is at the cutting edge of open-source GIS-focused computational geometry, and has been for a long time (of course, industry has its own tools, but that's not what we're talking about).

I can sort of see your point about the merits of global, spheroidal geometry, certainly from a user's perspective. But there's no getting around the fact that the geometry calculations are both slower (I'm tempted to say "inherently"…) and far more complex to implement (just look at how painful it is to write a performant, accurate r- or r*-tree for spherical coordinates) along every dimension. That's not going to change any time soon, so the projection workflow probably isn't going anywhere.

Demiurge 7 hours ago
You're right in that pretty much anything can be done via an API exposed via a database function. However, as they say... if it can be done, does mean it should? Now, I agree that having more sophisticated multi-dim calculations would be cool, but I've just rarely ran into needing or even wanting to do this, over many projects, some involving accurate simulations. In practice, database has always been for storing and querying data, which can be extremely accurate. I am probably the first person to abuse SQL and I've written some 3D ECEF rotation code in SQL, but it was a hack for a deadline, not because it was the right thing to do. All the projects I've worked with, had external models or components that did the "precise work" using complex code that I would never dare to make dependent on any database.

I'm actually curious, speaking for yourself, what kind of analysis you're doing where something like NAD83, or UTM does not give you enough precision? Is this actually "real world" geospatial data? If I have a soil model, I have a very localized analysis, and if I have a global climate model, we're talking kilometers for grid cells. In all these cases, the collected data has built in geolocation error MUCH grater than most decent projections...

So, what analysis are you doing where you need centimeter precision at global scale of thousands of kilometers? Sounds really interesting. The only time I've seen this, is doing space flight simulations where the error really accumulates into the future.

kashifr 9 hours ago
Have you tried https://geobase.app/ they recently also had a post about duckdb integration: https://geobase.app/blog/duckdb-1-1-3
jparishy 31 minutes ago
I had not, looks pretty cool but solves the inverse of the problem as I see it. I want a backend agnostic frontend toolset that is a GIS that I can customize to my needs. I don't want to implement the tools myself, that's too low level. I don't want the service to manage, control, or own the data, that's too high level. There's a sweet spot I don't think is being hit yet.
stevage 20 hours ago
I was just about to get into Felt then they took away the free tier and made it very expensive.
mtmail 19 hours ago
https://atlas.co/ still has a free tier. Less features I think, depends on your use case of course.
lugarlugarlugar 14 hours ago
[dead]
perrygeo 4 hours ago
Why? The article is light on details. Yes, having spatial analysis combined with SQL is awesome and very natural. There's nothing special about 2D geometries that makes them significantly different from floats and strings in an RDBMS perspective - geometry is just another column type, albeit with some special operators and indexes. We've been doing it with PostGIS, Spatialite, etc for two decades at this point.

What DuckDB brings to the table is cloud-native formats. This is standard geospatial functionality attached to an object store instead of a disk. As such, it doesn't require running a database process - data is always "at rest" and available over HTTP. I'm not downplaying the accomplishment, it's really convenient. But know that this is a repackaging of existing tech to work efficiently within a cloud IO environment. If anything, the major innovation of DuckDB is in data management not geospatial per se.

mpalmer 4 hours ago
The argument seemed pretty clear to me; more accessible tooling means more users and contributors. But as an apparent SME I could see how you'd feel clickbaited by the title.
perrygeo 3 hours ago
Mainly a point of clarification - I don't think DuckDB represents anything new in geospatial. It represents a new paradigm for data management and data architecture. Understanding the difference isn't optional.

DuckDB handles floating point numbers too - is DuckDB the most important thing in floating point data? Of course not, the data types and operators haven't changed. The underlying compute environment has. That's where the innovation is. I'd simply appreciate if technical writers took two seconds to make this vital distinction - being precise isn't hard and I don't know what we gain by publishing intentionally muddled technical articles.

vincnetas 1 hour ago
DBeaver (db client) has built in support for displaying geo data. In my case postgis results. I see that duckdb spatial functions are almost identical to postgis ones.

https://dbeaver.com/docs/dbeaver/Working-with-Spatial-GIS-da...

wodenokoto 23 hours ago
I’m not sure I agree that “install geospatial” is a game changer in simplicity compared to “pip install geopandas”.

They are both one line.

maxxen 22 hours ago
I think a big part is that duckdbs spatial extension doesnt have any transitive dependencies (except libc). It statically packages the standard suite of foss gis tools (including a whole database of coordinate systems) for multiple platforms (including WASM) and provides a unified SQL interface to it all.

(Disclaimer, I work on duckdb-spatial @duckdblabs)

jessekv 8 hours ago
I tried out duckdb-spatial for a hobby project and my takeaway is the main thing it offers over geopandas is performance in batch processing.

If I can reduce my spatial analysis to SQL and optimize that SQL a little bit, duckdb will happily saturate my CPUs and memory to process things fast in a way that a Python script struggles to do.

Yes, the static linking is nice. It helps with the problem of reproducible environments in Python. But that's not a game changer IMO.

carlhjerpe 21 hours ago
"Accessibility" is too often dismissed, yes you CAN do things with things, but getting people to do it is a craft and art. This is often where open-ish-core differs from the enterprise version of something too
WD-42 23 hours ago
Is it that much simpler than ‘load extension postgis’? I know geos and gdal have always kinda been a pain, but I feel like docker has abstracted it all away anyway. ‘docker pull postgis’ is pretty easy, granted I’m not familiar with what else duckdb offers.
dbreunig 22 hours ago
Yes. The difference between provisioning a server and running 'install spatial' in a CLI is night and day.

Docker has been a big improvement (when I was first learning PostGIS, the amount of time I had to hunt for proj directories or compile software just to install the plugin was a major hurdle), but it's many steps away from:

``` $ duckdb D install spatial; ```

Demiurge 20 hours ago
What do you mean by "provisioning a server"? That's a strange requirement. You can install Postgis on a macbook in one command, or actually on all 3 major OS's in one command: "brew install postgis", "apt-get install postgresql-postgis, and "choco install postgis-9.3". Does DuckDB not require a "server" or a "computer"? What does Docker have to do with anything? This is a very confusing train of thought.
18 hours ago
debugnik 10 hours ago
> Does DuckDB not require a "server"

No, it works like SQLite.

Doctor_Fegg 13 hours ago
PostGIS is included in Postgres.app which is a single executable for Mac. DuckDB appears also to be a single file download for Mac. I’m not sure your “when I was first learning PostGIS” experience reflects the current situation.

https://postgresapp.com/

frainfreeze 21 hours ago
I mean I like duckdb but this feels like you're pushing for it. On my system postgis comes from apt install, and it's one command to activate the "plugin". Is the night and day part not having to run random sh script from the internet to install software on my system?
dbreunig 21 hours ago
That’s great! The difference is you’re familiar and know how to do that

Getting started from 0 with geo can be difficult for those unfamiliar. DuckDB packages everything into one line with one dependency.

tomnipotent 21 hours ago
DuckDB doesn't require a running server. I run duckdb in a terminal, query 10,000 CSV or parquet files and run SQL on them while joining to data hosted in sqlite, a separate duckdb file using its native format, or even Postgres.
frainfreeze 22 hours ago
It is not simpler. I use it with testcontainers in the notebooks usually https://testcontainers-python.readthedocs.io/en/latest/
serjester 4 hours ago
I think this is part of a broader trend of geospatial data just becoming easier to work with. DuckDB is great for quick ad hoc stuff, but I find polars to be easier to maintain. Personally, I'm really excited for polars to (eventually) add true geospatial support. In the meantime, creating a custom h3 plugin only took a couple days and it simplified massive parts of our old geo pandas / duckdb code. The faster we can completely get rid of geo pandas, the better.

[1] https://github.com/Filimoa/polars-h3

mettamage 4 hours ago
Is it possible with polars to store data more efficiently than a database though?

I work with polars, but I haven't delved too deep into the performance characteristics versus postgres or anything like that.

fifilura 4 hours ago
I have been using Trino/AWS Athena for some geospatial work.

It's API is not very well covered, some parts are still missing, but it must just be a matter of time.

Where it shines is when you need to do an O(n2) or O(nm) type of calculation. Then those 100s of free CPU cores really come in handy! And the end result is pretty often a dollar for CPU-days worth of computation.

Example of O(nm) calculation are things like finding the closest road segment inside a tile (or more likely a tile and it's surrounding tiles), for each point in a list.

elchief 2 hours ago
I wish to god they would allow 2 connections. One read-only for my BI tool, and one read-write for dbt/sqlmesh
kianN 2 hours ago
I think using a pyarrow dataset as an intermediary would allow for a zero copy read/write from one connection into a second connection.
twelvechairs 23 hours ago
DuckDB is a great thing for geospatial but most important of the past decade? There's so many tools in different categories it wouldnt come near top for me. Some might be QGIS, postGIS (still the standard), ArcGIS online (still the standard), JS mapping tools like mapbox (i prefer deckgl), new data types like COG, geopackage and geoparquet, photogrammetry tools, 3d tiles, core libraries like gdal and now pdal, shapely, etc.
dbreunig 22 hours ago
Most of those tools came out circa ~2000.

Yeah, I feel old.

twelvechairs 12 hours ago
It clearly says "most important of the past 10 years" not "most important that has been invented in the past 10 years". Even taking your definition that would narrow down the list like half maybe and you should probably know that
22 hours ago
oreilles 9 hours ago
Chiming in to promote a similar project, a geospatial extension for Polars [1] I'm working on. It's not stable yet (abeit pretty close to), but is already pretty feature complete (it uses GEOS and PROJ as a backend, so has parity with GeoPandas).

[1] https://github.com/oreilles/polars-st/

willtemperley 10 hours ago
I have some concerns regarding licensing of DuckDB and GEOS which DuckDB spatial depends on. The former is MIT licensed and the latter LGPL 2.1.

This leads to some complex situations where some builds would contravene LGPL 2.1 e.g static linking with a closed source application.

quasarj 3 hours ago
I've been out of the loop on DuckDB, where can I get a real overview of what the excitement is about? It just looks like a new sqlite from the CLI...
Groxx 3 hours ago
It's essentially a column-oriented sqlite, so it's much faster for some kinds of queries. There's a big market for that category of queries, and nobody wants to run their own hadoop cluster to do single-machine-scale stuff.
3 hours ago
kriro 10 hours ago
That's a pretty grandiose statement and frankly the kind of advertisement I'm not a fan of at all.

If you want to import something and work with it GeoPandas exists. If you want something integrated with a SQL database, PostGIS exists.

On the application side of the spectrum, GRASS GIS, QGIS etc. say hi. They are being used in the agriculture industry and by government agencies (at least I know that's the case in Germany and Brazil).

aynyc 8 hours ago
How big are the data sets? I've been trying to get duckdb to work in our company on financial transactions and reporting data. The dataset is around 500GB CSV in S3 and duckdb chokes on it.
snake_doc 7 hours ago
Are you querying from an EC2 instance close to the S3 data? Are the CSVs partitioned into separate files? Does the machine have 500GB of memory? It’s not always duckdb fault when there can be a clear I/O bottleneck…
aynyc 5 hours ago
No, the EC2 instance doesn't have 500GB of data. Does DuckDB require that? I actually downloaded the data from S3 to local EBS and still choked.
higeorge13 6 hours ago
Could you test with clickhouse-local? It always works better for me.
aynyc 5 hours ago
No, clickhouse is not considered for some other reason. But I think I might revisit it sometime in the future.
nojito 6 hours ago
CSV are a poor format to access from S3.

Should convert them to parquet then access and analytics becomes cheap and fast.

aynyc 5 hours ago
I agree. That's how our data is produced. We constantly generate real time data into CSV. As far as I can tell, I can't append to parquet file.
feverzsj 15 hours ago
If you are doing lots of spatial index queries, it's actually much slower than SpatiaLite. Because DuckDB uses column-wise storage.
isuckatcoding 5 hours ago
I was expecting example queries but all I got was how to install a package :(
patja 22 hours ago
SQL Server has geospatial capabilities without any extensions or add-ons. I've been happily using geospatial datatypes on the free Express version for years, probably well over a decade.
fidotron 22 hours ago
Honestly, I think it's actually https://www.uber.com/en-CA/blog/h3/
sroerick 4 hours ago
Could you elaborate on this? I experimented with h3 a bit for queries but I have never used it in production. Obviously, it's very effective for Uber but I wonder if you have any other experience with it
jsemrau 19 hours ago
Makes one wonder if the YOLO algorithm would work better with hexagons.

"Hexagons were an important choice because people in a city are often in motion, and hexagons minimize the quantization error introduced when users move through a city. Hexagons also allow us to approximate radiuses easily, such as in this example using Elasticsearch."

[Edit]Maybe https://www.researchgate.net/publication/372766828_YOLOv8_fo...

badmonster 22 hours ago
How might embedding spatial capabilities directly into general-purpose data tools like DuckDB reshape who participates in geospatial analysis—and what kinds of problems they choose to solve?
bingaweek 22 hours ago
We need a "come on" clause for these absurd headlines. Come on.
22 hours ago
WD-42 21 hours ago
Uhoh, another pushover-licensed database. I wonder when it will begin it’s own redis saga.
jandrewrogers 20 hours ago
I think geospatial analytics is important (because of course I would), but to be frank geospatial software has been stagnant for a long time. Every new thing is just a fresh spin on the same stagnant things we already have. This more or less says exactly this?

For geospatial analysis, the most important thing that could happen in software would be no longer treating it, either explicitly or implicitly, as having anything to do with cartography. Many use cases are not remotely map-driven but the tools require users to force everything through the lens of map-making.

dbreunig 19 hours ago
I was struck by this as people suggest alternatives that refute the headline (QGIS, PostGIS, GDAL, etc): nearly every one emerged in the early 2000s.

Strongly agree with your sentiment around maps: most people can’t read them, they color the entire workflow and make it more complex, and (imo) lead to a general undervaluing of the geospatial field. Getting the data into columns means it’s usable by every department.

azinman2 20 hours ago
Can you give some examples?
jandrewrogers 19 hours ago
Of the stagnation? I’ve been doing geospatial analytics for over 20 years and shockingly little has changed, both in features and capability. Given the amount of time that has passed and the vastly expanded scope of the geospatial data models people are working with today, I think most people would expect more to have changed.
azinman2 17 hours ago
No I meant to this: “Many use cases are not remotely map-driven but the tools require users to force everything through the lens of map-making.”
cyanydeez 22 hours ago
No. QGIS is.

Good god.

dbreunig 22 hours ago
Author here.

QGIS is amazing. It's really great. It also came out in 2002, so I think the headline is safe.

cyanydeez 20 hours ago
Nope, it's constantly being improved and still wins the decade. Do you disqualify it because it existed? Re-read your headline, it's definitely not qualifying what you think it's qualifying.
jeffbee 23 hours ago
Ehh I tried to do some spatial stuff but there just wasn't enough there, or I could not figure out how to use it. Loading spatial information into ipython and fiddling with it is well-traveled and it doesn't seem to me that SQL is an inherently lower hurdle for the user.
yoyopa 22 hours ago
[dead]
fithisux 1 day ago
I agree