I'm a big fan of DuckDB and I do geospatial analysis, mostly around partitioning geographies (into Uber H3 hexagons), calculating Haversine distances, calculating areas of geometries, figuring out which geometry a point falls in, etc. Many of these features have existed in some form or other in geopandas or postgis, so DuckDB's spatial extensions bring nothing new.
But what DuckDB as an engine does is it lets me work directly on parquet/geoparquet files at scale (vectorized and parallelized) on my local desktop. It beats geopandas in that respect. It's a quality of life improvement to say the least.
DuckDB also has an extension architecture that admits more exotic geospatial features like Hilbert curves, Uber H3 support.
I totally agree with this. DuckDB for me was a huge QoL improvement just working with random datasets. I found it much easier to explore datasets using DuckDB rather than Pandas, Postgres or Databricks.
The spatial features were just barely out when I was last doing a lot of heavy geospatial work, but even then they were very nice.
An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.
pandas is very nice and was my bread and butter for a long time, but I frequently ran into memory issues and problems at scale with pandas, which I would never hit with polars or duckdb. I'm not sure if this holds true today as I know there's been updates, but it was certainly a problem then. Using geopandas ran into the same issues.
Just using GDAL and other libraries out of the box is frankly not a great experience. If you have a QGIS (another wonderful tool) workflow, it's frustrating to be dropping into Jupyter notebooks to do translations, but that seemed to be the best option.
In general, it just feels like geospatial analysis is about 10 years behind regular data analysis. Shapefiles are common because of ESRI dominance, but frankly not a great format. PostGIS is great, geopandas is great, but there's a lot more things in the data ecosystem than just Postgres and pandas. PowerBI barely had geospatial support a couple years ago. I think PowerBI Shapemaps exclusively used TopoJSON?
All of this is to say, DuckDB geospatial is very cool and helpful.
> An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.
What was shockingly useful in PowerBI compared to DuckDB?
Why do you use haver-sine over geodesic or reprojection?
I’ve been doing the reprojection thing, projecting coordinates to a “local” CRS, for previous projects mainly because that’s what geopandas recommend and is built around, but I am reaching a stage where I’d like to calculate distance for objects all over the globe, and I’m genuinely interested to learn what’s a good choice here.
Just an app dev, not a geospatial expert, but reprojection always seemed like something a library should handle under the hood unless one has specific needs. I'm used to the ergonomics / moron-proofing of something like Postgis' `ST_Distance(geography point1, geography point2)` and it gives you the the right answer in meters. You can easily switch to spherical or Cartesian distances if you need distance calculations to go faster. `ST_Area(geography geog)` and it gives you the size of your shape in square meters wherever on the planet.
> Prior to this, getting up and running from a cold-start might’ve required installing or even compiling severall OSS packages, carefully noting path locations, standing up a specialized database… Enough work that a data generalist might not have bothered, or their IT department might not have supported it.
I've been able to "CREATE EXTENSION postgis;" for more than a decade. There have been spatial extensions for PG, MySQL, Oracle, MS SQL Server, and SQLite for a long time. DuckDB doesn't make any material difference in how easy it is to install.
That requires data to already be in Postgres, otherwise you have to ETL data into it first.
DuckDB on the other hand works with data as-is (Parquet, TSV, sqlite, postgres... whether on disk, S3, etc.) with requiring an ETL step (though if the data isn't already in a columnar format, things are gonna be slow... but it will still work).
I work with Parquet data directly with no ETL step. I can literally drop into Jupyter or a Python REPL and duckdb.query("from '*.parquet'")
Correct me if I'm wrong, but I don't think that's possible with Postgis. (even pg_parquet requires copying? [1])
Yeah, if you want to work with GeoParquet, and you want to keep your data in that format. I can see how that's easer to use your example. That's not what a lot of geospatial data is in. You might have shapefiles, geopackages, geojsons, who knows? There is a lot of software, from QGIS to ESRI to work with different formats to solve different problems. I don't think GeoParquet, even though it might be the fastest geospatial vector data format right now, is that common, and the article did not claim that either. So, given an average user trying to answer some GIS question, some ETL is pretty much a given, on average. And given that, installing PostGIS and installing DuckDB, both require some ETL, and learning some query and analytics language. DuckDB might be an improvement, but it's certainly not as much of a leap as quote is making it out to be.
I'll give DuckDB a shot and compare to postgis. Eventually, I think the issue for me would be that I use all the vector outputs via apps connecting to the same database.
Yeah, just an example of a QoL issue with DuckDB: even though it can glob files in other cases, the way it passes parameters to GDAL means that globs are taken literally instead of expanded. So I can't query a directory with thirty million geojson files. This is not a problem in geopandas because ipython, being a full interactive development environment, allows me to produce the glob any way I choose.
I think this is a fundamental problem with the SQL pattern. You can try to make things just work, but when they fall then what?
I think this is just cause it hasn't been implemented in spatial yet. DuckDB is currently going through a pretty big refactor of the way we glob/scan/union multiple files with all the recent focus on data lake formats, but my plan is to get to it in spatial after next release when that part of the code has stabilized a bit.
SQL is a DSL and yes, all Domain Specific Languages will only enable what the engine parsing the DSL supports.
But all SQL database I'm aware of let you write custom extensions, which are exactly that: they extend the base functionality of the database with new paradigms. I.e postgis enabling geospatial in postgres or the extensions that enable fuzzy-matching/searching.
And as SQL is pretty much a turing-complete DSL, there is very little you can't do with it, even if the syntax might not agree with everyone
The original article feels a tremendous amount like another piece of DuckDB marketing, from the breathless admiration to the baseless claims like the title
“import geopandas” also exists and has for some time. Snark aside, WHAT is special about duckDB? I wish the author had actually shown some practical examples so I could understand their claims better.
I replied to another comment, but I think a big part is that duckdbs spatial extension provides a SQL interface to a whole suite of standard foss gis packages by statically bundling everything (including inlining the default PROJ database of coordinate projection systems into the binary) and providing it for multiple platforms (including WASM). I.E there are no transitive dependencies except libc.
Yes, DuckDB does a whole lot more, vectorized larger-than-memory execution, columnar compressed storage and a ecosystem of other extensions that make it more than the sum of its parts. But while Ive been working hard on making the spatial extension more performant and more broadly useful (I designdd a new geometry engine this year, and spatial join optimization just got merged on the dev-branch), the fact that you can e.g. convert too and from a myriad of different geospatial formats by utilizing GDAL, transforming through SQL, or pulling down the latest overture dump without having the whole workflow break just cause you updated QGIS has probably been the main killer feature for a lot of the early adopters.
(Discmaimer, I work on duckdb-spatial @ duckdblabs)
I'm not the OP, but thank you for such a detailed answer. The integration and reduced barriers to entry you mention mirror my own experiences with tooling in another area, and your explanation made parallels clear.
As someone unfamiliar with DuckDB but at least somewhat with geospatial tools (it's been a few years, though):
Dang - see, now that is seriously cool. The whole ETL shebang always was the biggest hassle, even with serious commercial tools, and the idea of a stable, all-in-one, ready-to-go layer is incredibly appealing.
It's just something the writer of the article should probably have at least mentioned when going full hyperbole with the title (hey, after reading this, it might actually be justified! :) ).
Rereading the article that focuses on the one thing that isn't a standout (the installation itself), though, I can't help but chuckle and think "worst sales pitch ever". ;)
Is there any strong reason to use GeoParquet instead of straight up parquet if all I'm interested in is storing and operating on lat/lons?
I'm curious if it compresses them better or something like that. I see lots of people online saying it compresses well (but mostly compared to .shp or similar) but normal parquet (.gz.parquet or .snappy.parquet) already does that really well. So it's not clear to me if I should spend time investigating it...
I mostly process normal parquet with spark and sometimes clickhouse right now.
Based on my reading of the GeoParquet spec, the main difference is that geometries are stored as WKB using Parquet's byte array type. Byte arrays can be delta-encoded. There is also some additional metadata stored like CRS and a bounding box.
When using EPSG:4326 lat/lons, I don't think GeoParquet would give you any benefits over just having separate columns (this is what I typically do, and it's plenty fast).
If you are using range requests to fetch only parts of parquet files at a time, you could potentially sort your data using a hilbert curve, which could limit the number of row groups that need to be fetched to execute a query.
duckdb has parquet support and can operate, in SQL syntax, on enormous 'tables' spread across huge collections of parquet files as if one virtual file. i believe the underlying implication is opportunities to leverage vector instructions on parquet. it's very "handy".
Ask anyone that's just starting out with geo pandas about their experience, and I'd be shocked if anyone calls it intuitive and straightforward. To geopandas credit, I think they just inherited many of Pandas' faults (why does a user need to understand indexes to do basic operations, no multi core support, very poor type hinting, etc).
Author here: what's special is that you can go from 0 to spatial data incredibly quickly, in the data generalist tool you're already using. It makes the audience of people working with geospatial data much bigger.
Probably no difference for your use-case (ST_Distance). If you already have data in Postgres, you should continue using Postgis.
In my use case, I use DuckDB because of speed at scale. I have 600GBs of lat-longs in Parquet files on disk.
If I wanted to use Postgis, I would have to ingest all this data into Postgres first.
With DuckDB, I can literally drop into a Jupyter notebook, and do this in under 10 seconds, and the results come back in a flash: (no need to ingest any data ahead of time)
I haven't yet understood this pattern (and I tried using duckdb). Unless you're only ever going to query those files once or twice in your life, importing them into postgres shouldn't be that long and then you can do the same or more than with DuckDB.
Also as a side note, is everyone just using DuckDB in memory? Because as soon as you want some multiple session stuff I'd assume you'd use DuckDB on top of a local database, so again I don't see the point but I'm sure I'm missing something.
> importing them into postgres shouldn't be that long and then you can do the same or more than with DuckDB.
Usually new data is generated regularly and would require creating a separate ETL process to ingest into Postgres. With DuckDB, no ETL is needed. New Parquet files are just read off the disk.
> Also as a side note, is everyone just using DuckDB in memory?
DuckDB is generally used as a single-user, and yes in-memory use case is most common. Not sure about use cases where a single user requires multiple sessions? But DuckDB does have read concurrency, session isolation etc. I believe write serialization is supported in multiple sessions.
With Parquet files, it's append-only so the "write" use-cases tend to be more limited. Generally another process generates those Parquet files. DuckDB just works with them.
This part was not obvious. In a lot of cases geodata is mostly stable and reads/searches dominate over appends. And that’s why we keep this in DB (usually postgis, yes).
So DuckDB is optimised for very different use case and it is not always obvious when it’s mentioned
And now I'm curious whether there's a way to actually index external files (make these queries over 600GB faster) and have this index (or many indices) be persistent. I might have missed that when I looked at the docs...
I haven't used duckDB but the real comparison is presumably postgis? Which is also absent from the discussion, but I think what the author alludes to.
I have no major qualm with pandas and geopandas. However I use it when it's the only practical solution, not because I enjoy using it as a library. It sounds like pandas (or similar) vs a database?
Yeah, PostGIS is readily available, and postgres is much more widely used than DuckDB. Either I don't understand OP's argument for why this is so important or I just don't buy it.
If you're using JavaScript you install Turf. The concept that you can readily install spatial libraries is hardly earth shattering.
I've been researching DuckDB - while it has many technical merits I think the main argument will be ease of use. It has a lot of the operational advantages of sqlite paired with strong extensibility and good succinct documentation.
Folks who have been doing DevOps work are exasperated with crummy SaaS vendors or antiquated OSS options that have a high setup cost. DuckDB is just a mature project that offers an alternative, hence an easy fan favorite among hobbyists (I imagine at scale the opportunity costs change and it becomes less attractive).
I'm still getting feedback that many devs are not too comfortable with reading and writing SQL. They learned simple SELECT statements in school, but get confused by JOINs and GROUP BYs.
Random voice here: they should get better at SQL. Not 9 joins and GROUP BY and HAVING and other magic. But two joins and GROUP for sure. If one gets the 3NF already then join and others are a quick (2 week) learn.
I don't know if it's ever been a core skill. You can ask any dev. Many tell me SQL is not something they use every day, so even if they learned it, they usually forget it in no time.
SQL is burned into my muscle memory because I work on ML models with large datasets. It took me a lot of trial and error to get decent at SQL. I imagine most devs just don't have the reps because their work rarely exposes them to SQL.
There's no point in learning any much deeper SQL anymore, AI assistants have largely solved SQL querying. Just ask for what you want with natural language.
SQL queries are one area where correctness matters a lot, because downstream applications rely on them to be correct. If you get a query wrong, especially in an ETL process, you can generate a lots of garbage data for a very long time (I'm speaking from lived experience). It might take a long time to correct (via backfill) and sometimes the original data might no longer be available.
I use LLMs to generate ideas for writing complex SQL, but I always need to understand the query 100% first before I deploy. Mistakes are costly, and sometimes irreversible. I need to trust but verify, which requires deep SQL knowledge.
To write correct SQL, you not only have to know the syntax, but also the schema (easily provided to an LLM) and the expected data values (you can provide a sampling to an LLM which helps, but domain knowledge helps even more). There are lots of surprising things in data that only reveal themselves via visual inspection. (though I think with MCO, an LLM will one day be able to inspect the data itself by sampling the database)
Testing SQL for correctness is besides the point of how the SQL was generated. It's important of course to review and test anything. My point is SQL writing is particularly good already with current state of the art of assistants for it to be worth it to spend much time hand crafting complex queries.
While AI assistants can give ideas how to address something I'd strongly contradict the statement "there is no point in learning". You should understand the AI suggestion because it could also be very bad or working incorrectly. (If it hallunicated, you'll note the syntax error, that does indeed not require learning...)
It’s not a good database for many storage workloads because it only allows a single process to write to it. For example you wouldn’t want 20 app servers writing their logs or metrics or traces to a single duck db instance. You would instead have them write rotating log files that are somehow aggregated then queried through duck db via something like grafana.
So the operational savings are more for using it as a light weight data science tool that excels in specific workloads like geospacial. Wherever you need to do those computations more locally it will excel.
I don’t really think sql is a mandated devops skill, though a basic understanding of databases certainly is. Between online content, LLMs, as long as you can iteratively improve a query part by part through explain analyze etc it’ll be sufficient.
I work on geospatial apps and the software I think I am most excited about is https://felt.com/. I want to see them expand their tooling such that maps and data source authentication/authorization was controllable by the developer, to enable tenant isolation with proprietary data access. They could really disrupt how geospatial tech gets integrated into consumer apps.
This article doesn't acknowledge how niche this stuff is and it's a lot of training to get people to up to speed on coordinate systems, projections, transformations, etc. I would replace a lot of my custom built mapping tools with Felt if it were possible, so I could focus on our core geospatial processes and not the code to display and play with it in the browser, which is almost as big if not bigger in terms of LOC to maintain.
As mentioned by another commenter, this DuckDB DX as described is basically the same as PostGIS too.
Author here: the beauty of DuckDB spatial is that the projections and CRS options are hidden until you need them. For 90% of geospatial data usage people don't and shouldn't need to know about projections or CRS.
Yes, there are so many great tools to handle the complexity for the capital-G Geospatial work.
I love Felt too! Sam and team have built a great platform. But lots of times a map isn't needed; an analyst just needs it as a column.
PostGIS is also excellent! But having to start up a database server to work with data doesn't lend itself to casual usage.
The beauty of DuckDB is that it's there in a moment and in reach for data generalists.
My experience has been that data generalists should stay away from geospatial analysis precisely because they lack a full appreciation of the importance of spatial references. I've seen people fail at this task in so many ways. From "I don't need a library to reproject, I'll just use a haversine function" to "I'll just do a spatial join of these address points in WGS84 to these parcels in NAD27" to "these North Korean missiles aren't a threat because according to this map using a Mercator projection, we are out of range."
DuckDB is great, but the fact that it makes it easier for data generalists to make mistakes with geospatial data is mark against it, not in its favor.
I think we're mostly making the same point about complexity, ya.
To me, I think it's mostly a frontend problem stopping the spread of mapping in consumer apps. Backend geo is easy tbh. There is so much good, free tooling. Mapping frontend is hell and there is no good off the shelf solution I've seen. Some too low level, some too high level. I think we need a GIS-lite that is embeddable to hide the complexity and let app developers focus on their value add, and not paying the tax of having frontend developers fix endless issues with maps they don't understand.
edit: to clarify, I think there's a relationship between getting mapping valued by leadership such that the geo work can be even be done by analysts, and having more mapping tools exist in frontend apps such that those leaders see them and understand why geo matters. it needs to be more than just markers on the map, with broad exposure. hence my focus on frontend web. sorry if that felt disjointed
> it's a lot of training to get people to up to speed on coordinate systems, projections, transformations, etc
This can mostly be avoided entirely with a proper spheroidal reference system, computational geometry implementation, and indexing. Most uses of geospatial analytics are not cartographic in nature. The map is at best a presentation layer, it is not the data model, and some don’t use a map at all. Forcing people to learn obscure and esoteric cartographic systems to ask simple intuitive questions about geospatial relationships is a big part of the problem. There is no reason this needs to be part of the learning curve.
I’ve run experiments on unsophisticated users a few times with respect to this. If you give them a fully spheroidal WGS84 implementation for geospatial analytics, it mostly “just works” for them anywhere on the globe and without regard for geospatial extent. Yes, the software implementation is much less trivial but it is qualitatively superior UX because “the world” kind of behaves how people intuit it should without having to know anything about projections, transforms, etc. And to be honest, even if you do know about projections and transforms, the results are still often less than optimal.
The only issue that comes up is that a lot of cartographic visualization toolkits are somewhat broken if you have global data models or a lot of complex geometry. Lots of rendering artifacts. Something else to work on I guess.
Im inclined to agree, but unfortunately a huge amount of the existing data and processes in this space does not assume a spheroidal earth and come provided with a coordinate reference system. Ultimately there are also some domains where you got data that you explicitly don't want to interpret using spheroidal semantics, e.g. when working with a city plan - in which case the map _is_ the data model, and you definitely want the angles of a triangle to sum up to 180.
Similarly I see it as it an inevitable that you will deal with a problem related to these issues, then have a massive blocker if you aren't already familiar with the details because debugging requires more than a cursory understanding for hard problems. In the alternative, you don't know the problems exist and the code is broken. I don't think it's as simple as abstracting away the entire concept, which is what I would say is too high level like I mentioned above. I don't know the right answer here honestly, I think it will be disruptive when someone figures it out
> the software implementation is much less trivial
Aren't most geospatial tools just doing simple geometry? And therefore need to work on some sort of projection?
If you can do the math on the spheroidal model, ok you get better results and its easier to intuit like you said, but it's much more complicated math. Can you actually do that today with tools like QGIS and GDAL?
For what it's worth, you _can't_ use spherical approaches for most data. They're only used for points, in practice. Your spatial data is inherently stored/generated in ways that don't allow spherical approaches as soon as you start working with polygons, let alone things like rasters.
Yes, spherical representations of polygon data exist, but the data you import has already been "split" and undoing that is often impossible, or at best non-trivial. And then rasters are fundamentally impossible to represent that way.
Analysis uses projections for that reason. Spherical approaches aren't fundamentally "better" for most use cases. They're only strictly better if everything you're working with is a point.
A good point. Certainly for raster analysis it doesn't make sense.
But any type of vector data could be modeled on a sphere, right? Points, shapes, lines. And I saw "better" because even the best suited projection will have some small amount of distortion.
Either way, most things use planer geometry so projections are necessary, and you need to have some understanding of how all that works
Many do use simple geometry. This causes endless headaches for people who are not cartographers, they don’t expect that. The good geospatial tools usually support spheroidal models but it is not the default, you have to know to explicitly make sure it uses that (many people assume that is the default).
An additional issue is that the spheroidal implementations have undergone very little optimization, perhaps because they are not the defaults. So when people figure out how to turn them on, performance is suddenly terrible. Now you have people that believe spheroidal implementations are terribly slow, when in reality they just used a pathologically slow implementation. Really good performance-engineered spheroidal implementations are much faster than people assume based on the performance of open source implementations.
This is not really a problem, unless you’re trying to simulate some 3D space orbits, physics. The crossover from geo INFORMATION systems to geo simulation systems is a bit rough, but the projections and calculations on projected cartesian space are enough for many typical questions, like distance, area, routing. However, even topology support starts getting specialized, and the use cases are more niche. I think it’s asking a bit too much from a database/storage layer to do efficient calculations outside of those supported by GEOS. At this point, you might want to import the relevant data into higher level applications.
Speaking for myself, I was not referring to any kind of simulation systems. This is a standard requirement of many operational geospatial data models, and there are a lot of these in industry. Anything that works from a projection is a non-starter, this causes demonstrable issues for geospatial analysis at scale or if any kind of precision is required. Efficient calculation just means efficient code, there is nothing preventing this from existing in open source beyond people writing it. Yes, you may be able to get away with it if your data model is small, both geographically and data size, but that does not describe every company.
It is entirely possible to do this in databases. That is how it is actually done. The limitations of GEOS are not the limitations of software, it is not a particularly sophisticated implementation (even PostGIS doesn’t use it for the important parts last I checked). To some extent you are affirming that there is a lack of ambition in this part of the market in open source.
I wouldn't say it's correct to say that GEOS isn't particularly sophisticated. A lot of (certainly not all) GEOS algorithms are now ported from JTS, the primary author of which is Martin Davis (aka Dr JTS), who works at Crunchy Data, who provide the PostGIS extension. So the chain (again, mostly) goes JTS -> GEOS -> {PostGIS, Shapely} -> … . Martin's work is at the cutting edge of open-source GIS-focused computational geometry, and has been for a long time (of course, industry has its own tools, but that's not what we're talking about).
I can sort of see your point about the merits of global, spheroidal geometry, certainly from a user's perspective. But there's no getting around the fact that the geometry calculations are both slower (I'm tempted to say "inherently"…) and far more complex to implement (just look at how painful it is to write a performant, accurate r- or r*-tree for spherical coordinates) along every dimension. That's not going to change any time soon, so the projection workflow probably isn't going anywhere.
You're right in that pretty much anything can be done via an API exposed via a database function. However, as they say... if it can be done, does mean it should? Now, I agree that having more sophisticated multi-dim calculations would be cool, but I've just rarely ran into needing or even wanting to do this, over many projects, some involving accurate simulations. In practice, database has always been for storing and querying data, which can be extremely accurate. I am probably the first person to abuse SQL and I've written some 3D ECEF rotation code in SQL, but it was a hack for a deadline, not because it was the right thing to do. All the projects I've worked with, had external models or components that did the "precise work" using complex code that I would never dare to make dependent on any database.
I'm actually curious, speaking for yourself, what kind of analysis you're doing where something like NAD83, or UTM does not give you enough precision? Is this actually "real world" geospatial data? If I have a soil model, I have a very localized analysis, and if I have a global climate model, we're talking kilometers for grid cells. In all these cases, the collected data has built in geolocation error MUCH grater than most decent projections...
So, what analysis are you doing where you need centimeter precision at global scale of thousands of kilometers? Sounds really interesting. The only time I've seen this, is doing space flight simulations where the error really accumulates into the future.
I had not, looks pretty cool but solves the inverse of the problem as I see it. I want a backend agnostic frontend toolset that is a GIS that I can customize to my needs. I don't want to implement the tools myself, that's too low level. I don't want the service to manage, control, or own the data, that's too high level. There's a sweet spot I don't think is being hit yet.
Why? The article is light on details. Yes, having spatial analysis combined with SQL is awesome and very natural. There's nothing special about 2D geometries that makes them significantly different from floats and strings in an RDBMS perspective - geometry is just another column type, albeit with some special operators and indexes. We've been doing it with PostGIS, Spatialite, etc for two decades at this point.
What DuckDB brings to the table is cloud-native formats. This is standard geospatial functionality attached to an object store instead of a disk. As such, it doesn't require running a database process - data is always "at rest" and available over HTTP. I'm not downplaying the accomplishment, it's really convenient. But know that this is a repackaging of existing tech to work efficiently within a cloud IO environment. If anything, the major innovation of DuckDB is in data management not geospatial per se.
The argument seemed pretty clear to me; more accessible tooling means more users and contributors. But as an apparent SME I could see how you'd feel clickbaited by the title.
Mainly a point of clarification - I don't think DuckDB represents anything new in geospatial. It represents a new paradigm for data management and data architecture. Understanding the difference isn't optional.
DuckDB handles floating point numbers too - is DuckDB the most important thing in floating point data? Of course not, the data types and operators haven't changed. The underlying compute environment has. That's where the innovation is. I'd simply appreciate if technical writers took two seconds to make this vital distinction - being precise isn't hard and I don't know what we gain by publishing intentionally muddled technical articles.
DBeaver (db client) has built in support for displaying geo data. In my case postgis results. I see that duckdb spatial functions are almost identical to postgis ones.
I think a big part is that duckdbs spatial extension doesnt have any transitive dependencies (except libc). It statically packages the standard suite of foss gis tools (including a whole database of coordinate systems) for multiple platforms (including WASM) and provides a unified SQL interface to it all.
(Disclaimer, I work on duckdb-spatial @duckdblabs)
I tried out duckdb-spatial for a hobby project and my takeaway is the main thing it offers over geopandas is performance in batch processing.
If I can reduce my spatial analysis to SQL and optimize that SQL a little bit, duckdb will happily saturate my CPUs and memory to process things fast in a way that a Python script struggles to do.
Yes, the static linking is nice. It helps with the problem of reproducible environments in Python. But that's not a game changer IMO.
"Accessibility" is too often dismissed, yes you CAN do things with things, but getting people to do it is a craft and art. This is often where open-ish-core differs from the enterprise version of something too
Is it that much simpler than ‘load extension postgis’? I know geos and gdal have always kinda been a pain, but I feel like docker has abstracted it all away anyway. ‘docker pull postgis’ is pretty easy, granted I’m not familiar with what else duckdb offers.
Yes. The difference between provisioning a server and running 'install spatial' in a CLI is night and day.
Docker has been a big improvement (when I was first learning PostGIS, the amount of time I had to hunt for proj directories or compile software just to install the plugin was a major hurdle), but it's many steps away from:
What do you mean by "provisioning a server"? That's a strange requirement. You can install Postgis on a macbook in one command, or actually on all 3 major OS's in one command: "brew install postgis", "apt-get install postgresql-postgis, and "choco install postgis-9.3". Does DuckDB not require a "server" or a "computer"? What does Docker have to do with anything? This is a very confusing train of thought.
PostGIS is included in Postgres.app which is a single executable for Mac. DuckDB appears also to be a single file download for Mac. I’m not sure your “when I was first learning PostGIS” experience reflects the current situation.
I mean I like duckdb but this feels like you're pushing for it. On my system postgis comes from apt install, and it's one command to activate the "plugin". Is the night and day part not having to run random sh script from the internet to install software on my system?
DuckDB doesn't require a running server. I run duckdb in a terminal, query 10,000 CSV or parquet files and run SQL on them while joining to data hosted in sqlite, a separate duckdb file using its native format, or even Postgres.
I think this is part of a broader trend of geospatial data just becoming easier to work with. DuckDB is great for quick ad hoc stuff, but I find polars to be easier to maintain. Personally, I'm really excited for polars to (eventually) add true geospatial support. In the meantime, creating a custom h3 plugin only took a couple days and it simplified massive parts of our old geo pandas / duckdb code. The faster we can completely get rid of geo pandas, the better.
I have been using Trino/AWS Athena for some geospatial work.
It's API is not very well covered, some parts are still missing, but it must just be a matter of time.
Where it shines is when you need to do an O(n2) or O(nm) type of calculation. Then those 100s of free CPU cores really come in handy! And the end result is pretty often a dollar for CPU-days worth of computation.
Example of O(nm) calculation are things like finding the closest road segment inside a tile (or more likely a tile and it's surrounding tiles), for each point in a list.
DuckDB is a great thing for geospatial but most important of the past decade? There's so many tools in different categories it wouldnt come near top for me. Some might be QGIS, postGIS (still the standard), ArcGIS online (still the standard), JS mapping tools like mapbox (i prefer deckgl), new data types like COG, geopackage and geoparquet, photogrammetry tools, 3d tiles, core libraries like gdal and now pdal, shapely, etc.
It clearly says "most important of the past 10 years" not "most important that has been invented in the past 10 years". Even taking your definition that would narrow down the list like half maybe and you should probably know that
Chiming in to promote a similar project, a geospatial extension for Polars [1] I'm working on. It's not stable yet (abeit pretty close to), but is already pretty feature complete (it uses GEOS and PROJ as a backend, so has parity with GeoPandas).
It's essentially a column-oriented sqlite, so it's much faster for some kinds of queries. There's a big market for that category of queries, and nobody wants to run their own hadoop cluster to do single-machine-scale stuff.
That's a pretty grandiose statement and frankly the kind of advertisement I'm not a fan of at all.
If you want to import something and work with it GeoPandas exists. If you want something integrated with a SQL database, PostGIS exists.
On the application side of the spectrum, GRASS GIS, QGIS etc. say hi. They are being used in the agriculture industry and by government agencies (at least I know that's the case in Germany and Brazil).
How big are the data sets? I've been trying to get duckdb to work in our company on financial transactions and reporting data. The dataset is around 500GB CSV in S3 and duckdb chokes on it.
Are you querying from an EC2 instance close to the S3 data? Are the CSVs partitioned into separate files? Does the machine have 500GB of memory? It’s not always duckdb fault when there can be a clear I/O bottleneck…
SQL Server has geospatial capabilities without any extensions or add-ons. I've been happily using geospatial datatypes on the free Express version for years, probably well over a decade.
Could you elaborate on this? I experimented with h3 a bit for queries but I have never used it in production. Obviously, it's very effective for Uber but I wonder if you have any other experience with it
Makes one wonder if the YOLO algorithm would work better with hexagons.
"Hexagons were an important choice because people in a city are often in motion, and hexagons minimize the quantization error introduced when users move through a city. Hexagons also allow us to approximate radiuses easily, such as in this example using Elasticsearch."
How might embedding spatial capabilities directly into general-purpose data tools like DuckDB reshape who participates in geospatial analysis—and what kinds of problems they choose to solve?
I think geospatial analytics is important (because of course I would), but to be frank geospatial software has been stagnant for a long time. Every new thing is just a fresh spin on the same stagnant things we already have. This more or less says exactly this?
For geospatial analysis, the most important thing that could happen in software would be no longer treating it, either explicitly or implicitly, as having anything to do with cartography. Many use cases are not remotely map-driven but the tools require users to force everything through the lens of map-making.
I was struck by this as people suggest alternatives that refute the headline (QGIS, PostGIS, GDAL, etc): nearly every one emerged in the early 2000s.
Strongly agree with your sentiment around maps: most people can’t read them, they color the entire workflow and make it more complex, and (imo) lead to a general undervaluing of the geospatial field. Getting the data into columns means it’s usable by every department.
Of the stagnation? I’ve been doing geospatial analytics for over 20 years and shockingly little has changed, both in features and capability. Given the amount of time that has passed and the vastly expanded scope of the geospatial data models people are working with today, I think most people would expect more to have changed.
Nope, it's constantly being improved and still wins the decade. Do you disqualify it because it existed? Re-read your headline, it's definitely not qualifying what you think it's qualifying.
Ehh I tried to do some spatial stuff but there just wasn't enough there, or I could not figure out how to use it. Loading spatial information into ipython and fiddling with it is well-traveled and it doesn't seem to me that SQL is an inherently lower hurdle for the user.