Fresh Hacker News | DuckDB is probably the most important geospatial software of the last decade

▲DuckDB is probably the most important geospatial software of the last decade(dbreunig.com)

491 points by dbreunig 243 days ago | 30 comments

▲wenc 243 days ago

I'm a big fan of DuckDB and I do geospatial analysis, mostly around partitioning geographies (into Uber H3 hexagons), calculating Haversine distances, calculating areas of geometries, figuring out which geometry a point falls in, etc. Many of these features have existed in some form or other in geopandas or postgis, so DuckDB's spatial extensions bring nothing new.

But what DuckDB as an engine does is it lets me work directly on parquet/geoparquet files at scale (vectorized and parallelized) on my local desktop. It beats geopandas in that respect. It's a quality of life improvement to say the least.

DuckDB also has an extension architecture that admits more exotic geospatial features like Hilbert curves, Uber H3 support.

https://duckdb.org/docs/stable/extensions/spatial/functions....

https://duckdb.org/community_extensions/extensions/h3.html

▲sroerick 242 days ago

I totally agree with this. DuckDB for me was a huge QoL improvement just working with random datasets. I found it much easier to explore datasets using DuckDB rather than Pandas, Postgres or Databricks.

The spatial features were just barely out when I was last doing a lot of heavy geospatial work, but even then they were very nice.

An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.

pandas is very nice and was my bread and butter for a long time, but I frequently ran into memory issues and problems at scale with pandas, which I would never hit with polars or duckdb. I'm not sure if this holds true today as I know there's been updates, but it was certainly a problem then. Using geopandas ran into the same issues.

Just using GDAL and other libraries out of the box is frankly not a great experience. If you have a QGIS (another wonderful tool) workflow, it's frustrating to be dropping into Jupyter notebooks to do translations, but that seemed to be the best option.

In general, it just feels like geospatial analysis is about 10 years behind regular data analysis. Shapefiles are common because of ESRI dominance, but frankly not a great format. PostGIS is great, geopandas is great, but there's a lot more things in the data ecosystem than just Postgres and pandas. PowerBI barely had geospatial support a couple years ago. I think PowerBI Shapemaps exclusively used TopoJSON?

All of this is to say, DuckDB geospatial is very cool and helpful.

▲ngrilly 242 days ago

> An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.

What was shockingly useful in PowerBI compared to DuckDB?

▲datadrivenangel 242 days ago

Graphics. Good BI tools are very effectively exploratory data analysis tools.

▲wodenokoto 243 days ago

Why do you use haver-sine over geodesic or reprojection?

I’ve been doing the reprojection thing, projecting coordinates to a “local” CRS, for previous projects mainly because that’s what geopandas recommend and is built around, but I am reaching a stage where I’d like to calculate distance for objects all over the globe, and I’m genuinely interested to learn what’s a good choice here.

▲code_biologist 242 days ago

Just an app dev, not a geospatial expert, but reprojection always seemed like something a library should handle under the hood unless one has specific needs. I'm used to the ergonomics / moron-proofing of something like Postgis' `ST_Distance(geography point1, geography point2)` and it gives you the the right answer in meters. You can easily switch to spherical or Cartesian distances if you need distance calculations to go faster. `ST_Area(geography geog)` and it gives you the size of your shape in square meters wherever on the planet.

▲dmillar 242 days ago

What the "right answer" is will vary widely on application and analyst. That's one reason there are so many coordinate ref systems. If you keep everything in 3857, you'll get answers in ~meters, but whether that's "right" depends on where and how large the distance or geom is and what your precision threshold is. So, really, everyone's needs are necessarily "specific."

▲colkassad 242 days ago

Look into Vincenty [1] or Karney (for a more robust solution) [2]. Vincenty should be good enough for most use cases.

[1] https://en.wikipedia.org/wiki/Vincenty's_formulae

[2] https://github.com/pbrod/karney

▲wenc 242 days ago

Reprojection is accurate locally but inaccurate at scale.

Geodesics are the most accurate (Vincenty etc) but are computationally heavy.

Haversine is a nice middle ground.

▲wodenokoto 242 days ago

But is it actually inaccurate at scale?

I get that drawing a projection is inaccurate at scale, but if I do all my calculation as meters north/south and east/west of 0,0 on the equator, won’t all my distance calculations be correct?

Like 5.000.000 east and 0 m north is 5.000km from the origin. I cannot see how that could ever become inaccurate as I move further away.

Where is the inaccuracy introduced? When I reproject back to coordinates on a globe?

▲wenc 241 days ago

Btw, you can plug your comment into ChatGPT and it'll give you a reasonable answer. The short answer is: distortions.

▲ 242 days ago

▲everybodyknows 242 days ago

Looking just at the Hilbert reference, I'm wondering why there is no function to return, for a given level of precision, the set of segments along the curve containing corresponding to a sub-rectangle of the space. Is this functionality packaged up elsewhere?

▲Demiurge 243 days ago

> Prior to this, getting up and running from a cold-start might’ve required installing or even compiling severall OSS packages, carefully noting path locations, standing up a specialized database… Enough work that a data generalist might not have bothered, or their IT department might not have supported it.

I've been able to "CREATE EXTENSION postgis;" for more than a decade. There have been spatial extensions for PG, MySQL, Oracle, MS SQL Server, and SQLite for a long time. DuckDB doesn't make any material difference in how easy it is to install.

▲wenc 243 days ago

That requires data to already be in Postgres, otherwise you have to ETL data into it first.

DuckDB on the other hand works with data as-is (Parquet, TSV, sqlite, postgres... whether on disk, S3, etc.) with requiring an ETL step (though if the data isn't already in a columnar format, things are gonna be slow... but it will still work).

I work with Parquet data directly with no ETL step. I can literally drop into Jupyter or a Python REPL and duckdb.query("from '*.parquet'")

Correct me if I'm wrong, but I don't think that's possible with Postgis. (even pg_parquet requires copying? [1])

[1] https://www.crunchydata.com/blog/pg_parquet-an-extension-to-...

▲Demiurge 243 days ago

Yeah, if you want to work with GeoParquet, and you want to keep your data in that format. I can see how that's easer to use your example. That's not what a lot of geospatial data is in. You might have shapefiles, geopackages, geojsons, who knows? There is a lot of software, from QGIS to ESRI to work with different formats to solve different problems. I don't think GeoParquet, even though it might be the fastest geospatial vector data format right now, is that common, and the article did not claim that either. So, given an average user trying to answer some GIS question, some ETL is pretty much a given, on average. And given that, installing PostGIS and installing DuckDB, both require some ETL, and learning some query and analytics language. DuckDB might be an improvement, but it's certainly not as much of a leap as quote is making it out to be.

▲sroerick 242 days ago

From the DuckDB geospatial docs:

SELECT * FROM './path/to/some/shapefile/dataset.shp';

COPY table TO 'some/file/path/filename.geojson' WITH (FORMAT gdal, DRIVER 'GeoJSON', LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');

This seems extremely helpful to me.

▲Demiurge 242 days ago

And ogr2ogr is also very helpful:

ogr2ogr -f "PostgreSQL" -t_srs "EPSG:2274" PG:"host=host user=user dbname=database password=trustno1 schemas=schema" shapefile.shp -nln new_name_for_table

I'll give DuckDB a shot and compare to postgis. Eventually, I think the issue for me would be that I use all the vector outputs via apps connecting to the same database.

▲jeffbee 243 days ago

Yeah, just an example of a QoL issue with DuckDB: even though it can glob files in other cases, the way it passes parameters to GDAL means that globs are taken literally instead of expanded. So I can't query a directory with thirty million geojson files. This is not a problem in geopandas because ipython, being a full interactive development environment, allows me to produce the glob any way I choose.

I think this is a fundamental problem with the SQL pattern. You can try to make things just work, but when they fall then what?

▲maxxen 243 days ago

I think this is just cause it hasn't been implemented in spatial yet. DuckDB is currently going through a pretty big refactor of the way we glob/scan/union multiple files with all the recent focus on data lake formats, but my plan is to get to it in spatial after next release when that part of the code has stabilized a bit.

▲ffsm8 243 days ago

> fundamental problem with the SQL pattern.

SQL is a DSL and yes, all Domain Specific Languages will only enable what the engine parsing the DSL supports.

But all SQL database I'm aware of let you write custom extensions, which are exactly that: they extend the base functionality of the database with new paradigms. I.e postgis enabling geospatial in postgres or the extensions that enable fuzzy-matching/searching.

And as SQL is pretty much a turing-complete DSL, there is very little you can't do with it, even if the syntax might not agree with everyone

▲broner 242 days ago

You can use DuckDB in ipython to solve the globbing issue. Then you don't have to worry about OOMs with geopandas.

▲edoceo 243 days ago

Not wrong. Load to PG, then query. Duck UVP is like bringing 8 common tools/features under one tent.

▲paradox460 242 days ago

The original article feels a tremendous amount like another piece of DuckDB marketing, from the breathless admiration to the baseless claims like the title

▲keynesyoudigit 237 days ago

The benefits of a simple install were exaggerated, I think the real point is the cloud native integration and the sheer scalability

▲jokoon 242 days ago

I tested spatialite, it works okay, but the setup is a bit tedious when inserting data.

▲rcdwealth 242 days ago

[dead]

▲larsiusprime 243 days ago

“import geopandas” also exists and has for some time. Snark aside, WHAT is special about duckDB? I wish the author had actually shown some practical examples so I could understand their claims better.

▲maxxen 243 days ago

I replied to another comment, but I think a big part is that duckdbs spatial extension provides a SQL interface to a whole suite of standard foss gis packages by statically bundling everything (including inlining the default PROJ database of coordinate projection systems into the binary) and providing it for multiple platforms (including WASM). I.E there are no transitive dependencies except libc.

Yes, DuckDB does a whole lot more, vectorized larger-than-memory execution, columnar compressed storage and a ecosystem of other extensions that make it more than the sum of its parts. But while Ive been working hard on making the spatial extension more performant and more broadly useful (I designdd a new geometry engine this year, and spatial join optimization just got merged on the dev-branch), the fact that you can e.g. convert too and from a myriad of different geospatial formats by utilizing GDAL, transforming through SQL, or pulling down the latest overture dump without having the whole workflow break just cause you updated QGIS has probably been the main killer feature for a lot of the early adopters.

(Discmaimer, I work on duckdb-spatial @ duckdblabs)

▲timschmidt 243 days ago

I'm not the OP, but thank you for such a detailed answer. The integration and reduced barriers to entry you mention mirror my own experiences with tooling in another area, and your explanation made parallels clear.

▲carstonh 242 days ago

Here's an example building on @maxxen's work - because DuckDB (+spatial extension) can compile to Wasm, I built an in-browser Shapefile to CSV converter tool: https://www.honeycombmaps.com/tools/shapefile-to-csv-convert...

▲larsiusprime 243 days ago

This is an excellent reply and what I wish the article had been, thanks!

▲vemom 243 days ago

Oh yeah the article does stop abruptly after "2 lines to install"

▲Amadiro 242 days ago

Is there any strong reason to use GeoParquet instead of straight up parquet if all I'm interested in is storing and operating on lat/lons?

I'm curious if it compresses them better or something like that. I see lots of people online saying it compresses well (but mostly compared to .shp or similar) but normal parquet (.gz.parquet or .snappy.parquet) already does that really well. So it's not clear to me if I should spend time investigating it...

I mostly process normal parquet with spark and sometimes clickhouse right now.

▲carstonh 242 days ago

Based on my reading of the GeoParquet spec, the main difference is that geometries are stored as WKB using Parquet's byte array type. Byte arrays can be delta-encoded. There is also some additional metadata stored like CRS and a bounding box.

When using EPSG:4326 lat/lons, I don't think GeoParquet would give you any benefits over just having separate columns (this is what I typically do, and it's plenty fast).

If you are using range requests to fetch only parts of parquet files at a time, you could potentially sort your data using a hilbert curve, which could limit the number of row groups that need to be fetched to execute a query.

▲larodi 242 days ago

> a big part is that duckdbs spatial extension provides a SQL interface to a whole suite of standard foss gis packages by statically bundling everything (including inlining the default PROJ database of coordinate projection systems into the binary) and providing it for multiple platforms (including WASM). I.E there are no transitive dependencies except libc.

and for the last twenty, not ten years, this is what PostGIS was pioneering, and also teaching everyone get used to. DuckDB was not something that people even knew in GIS world. I'm not even sure whether QGIS connects to DuckDB, perhaps it does for a while, but it sure knows Spatialite for very long and last, but not least - ESRI sure as f*ck still have not heard of DuckDB. This is already half the geospatial world out there.

This whole article is superb biased and its very sad.

▲maxxen 241 days ago

Actually, they do https://www.esri.com/arcgis-blog/products/arcgis-pro/announc...

▲larodi 241 days ago

no surprise I missed it, 3.4 was only released in 25.11.2024 г. which is like yesterday in ESRI terms, given price and speed of adoption of new versions (note: many .gov still use the 10.x/11.x branch).

which does not change my original statement that nobody cared about DuckDB for very long, while the whole server-side processing idea is largely based on PostGIS.

▲dahauns 242 days ago

As someone unfamiliar with DuckDB but at least somewhat with geospatial tools (it's been a few years, though): Dang - see, now that is seriously cool. The whole ETL shebang always was the biggest hassle, even with serious commercial tools, and the idea of a stable, all-in-one, ready-to-go layer is incredibly appealing.

It's just something the writer of the article should probably have at least mentioned when going full hyperbole with the title (hey, after reading this, it might actually be justified! :) ).

Rereading the article that focuses on the one thing that isn't a standout (the installation itself), though, I can't help but chuckle and think "worst sales pitch ever". ;)

▲tmpz22 243 days ago

I've been researching DuckDB - while it has many technical merits I think the main argument will be ease of use. It has a lot of the operational advantages of sqlite paired with strong extensibility and good succinct documentation.

Folks who have been doing DevOps work are exasperated with crummy SaaS vendors or antiquated OSS options that have a high setup cost. DuckDB is just a mature project that offers an alternative, hence an easy fan favorite among hobbyists (I imagine at scale the opportunity costs change and it becomes less attractive).

▲wenc 243 days ago

How is the adoption among DevOps folks?

I'm still getting feedback that many devs are not too comfortable with reading and writing SQL. They learned simple SELECT statements in school, but get confused by JOINs and GROUP BYs.

▲edoceo 243 days ago

Random voice here: they should get better at SQL. Not 9 joins and GROUP BY and HAVING and other magic. But two joins and GROUP for sure. If one gets the 3NF already then join and others are a quick (2 week) learn.

I'd pick that before traveling the DuckDB path.

▲tmpz22 242 days ago

It’s not a good database for many storage workloads because it only allows a single process to write to it. For example you wouldn’t want 20 app servers writing their logs or metrics or traces to a single duck db instance. You would instead have them write rotating log files that are somehow aggregated then queried through duck db via something like grafana.

So the operational savings are more for using it as a light weight data science tool that excels in specific workloads like geospacial. Wherever you need to do those computations more locally it will excel.

I don’t really think sql is a mandated devops skill, though a basic understanding of databases certainly is. Between online content, LLMs, as long as you can iteratively improve a query part by part through explain analyze etc it’ll be sufficient.

▲wenc 242 days ago

DuckDB can be thought of as an analytics engine rather than storage engine. It’s OLAP rather than OLTP.

▲Yeroc 243 days ago

Since when is strong SQL knowledge not a core skill for developers? I suppose it's the rise of frontend vs backend specialization that is the cause?

▲wenc 243 days ago

I don't know if it's ever been a core skill. You can ask any dev. Many tell me SQL is not something they use every day, so even if they learned it, they usually forget it in no time.

SQL is burned into my muscle memory because I work on ML models with large datasets. It took me a lot of trial and error to get decent at SQL. I imagine most devs just don't have the reps because their work rarely exposes them to SQL.

▲vasco 243 days ago

There's no point in learning any much deeper SQL anymore, AI assistants have largely solved SQL querying. Just ask for what you want with natural language.

▲wenc 243 days ago

I don't quite agree.

SQL queries are one area where correctness matters a lot, because downstream applications rely on them to be correct. If you get a query wrong, especially in an ETL process, you can generate a lots of garbage data for a very long time (I'm speaking from lived experience). It might take a long time to correct (via backfill) and sometimes the original data might no longer be available.

I use LLMs to generate ideas for writing complex SQL, but I always need to understand the query 100% first before I deploy. Mistakes are costly, and sometimes irreversible. I need to trust but verify, which requires deep SQL knowledge.

To write correct SQL, you not only have to know the syntax, but also the schema (easily provided to an LLM) and the expected data values (you can provide a sampling to an LLM which helps, but domain knowledge helps even more). There are lots of surprising things in data that only reveal themselves via visual inspection. (though I think with MCO, an LLM will one day be able to inspect the data itself by sampling the database)

▲8note 242 days ago

this suggests that i should expect that dara coming out of SQL is likely to be wrong, since lots of people writing it dont have the deep knowledge, and its very hard to verify correctness of the results since there is limited to no ground truth.

if i already expect it to be making at least some of a mess, why not have AI as part of the setup?

▲ 242 days ago

▲vasco 242 days ago

Testing SQL for correctness is besides the point of how the SQL was generated. It's important of course to review and test anything. My point is SQL writing is particularly good already with current state of the art of assistants for it to be worth it to spend much time hand crafting complex queries.

▲ 242 days ago

▲usr1106 243 days ago

While AI assistants can give ideas how to address something I'd strongly contradict the statement "there is no point in learning". You should understand the AI suggestion because it could also be very bad or working incorrectly. (If it hallunicated, you'll note the syntax error, that does indeed not require learning...)

▲rotten 239 days ago

Getting a sql query to optimal performance is still much more of an art than a specific science. Having the LLM generate a query that appears to work (correctness issues aside), is much more likely than the LLM generating an optimal performing query. While this may not matter for one-off queries common in analytics, when you start worrying about scalability, even the tiniest tweaks can make a huge difference.

▲jjtheblunt 243 days ago

duckdb has parquet support and can operate, in SQL syntax, on enormous 'tables' spread across huge collections of parquet files as if one virtual file. i believe the underlying implication is opportunities to leverage vector instructions on parquet. it's very "handy".

▲getnormality 243 days ago

Everything is special about DuckDB. Pandas is way, way behind the state of the art in tabular data analysis.

▲ 243 days ago

▲dbreunig 243 days ago

Author here: what's special is that you can go from 0 to spatial data incredibly quickly, in the data generalist tool you're already using. It makes the audience of people working with geospatial data much bigger.

(Geopandas is great, too.)

▲dopidopHN 243 days ago

I’m very familiar with Postgres and spinning one with postgis seems easy enough. Do I get more with duckdb?

Most of the time I store locations and compute distance to them. Would that being faster to implement with duckdb

▲wenc 243 days ago

Probably no difference for your use-case (ST_Distance). If you already have data in Postgres, you should continue using Postgis.

In my use case, I use DuckDB because of speed at scale. I have 600GBs of lat-longs in Parquet files on disk.

If I wanted to use Postgis, I would have to ingest all this data into Postgres first.

With DuckDB, I can literally drop into a Jupyter notebook, and do this in under 10 seconds, and the results come back in a flash: (no need to ingest any data ahead of time)

  import duckdb
  duckdb.query("INSTALL spatial; LOAD spatial;")
  duckdb.query("select ST_DISTANCE(ST_POINT(lng1, lat1), ST_POINT(lng2, lat2)) dist from '/mydir/*.parquet'")

▲vasco 243 days ago

I haven't yet understood this pattern (and I tried using duckdb). Unless you're only ever going to query those files once or twice in your life, importing them into postgres shouldn't be that long and then you can do the same or more than with DuckDB.

Also as a side note, is everyone just using DuckDB in memory? Because as soon as you want some multiple session stuff I'd assume you'd use DuckDB on top of a local database, so again I don't see the point but I'm sure I'm missing something.

▲wenc 243 days ago

> importing them into postgres shouldn't be that long and then you can do the same or more than with DuckDB.

Usually new data is generated regularly and would require creating a separate ETL process to ingest into Postgres. With DuckDB, no ETL is needed. New Parquet files are just read off the disk.

> Also as a side note, is everyone just using DuckDB in memory?

DuckDB is generally used as a single-user, and yes in-memory use case is most common. Not sure about use cases where a single user requires multiple sessions? But DuckDB does have read concurrency, session isolation etc. I believe write serialization is supported in multiple sessions.

With Parquet files, it's append-only so the "write" use-cases tend to be more limited. Generally another process generates those Parquet files. DuckDB just works with them.

▲indeyets 242 days ago

> Usually new data is generated regularly

This part was not obvious. In a lot of cases geodata is mostly stable and reads/searches dominate over appends. And that’s why we keep this in DB (usually postgis, yes).

So DuckDB is optimised for very different use case and it is not always obvious when it’s mentioned

▲wenc 242 days ago

This is the more trivial use case (static data, heavy reads) and DuckDB is absolutely optimized for this use case.

DuckDB also provides a vectorized, parallelized engine. When I run a query all of my 32 cores light up on htop.

▲simlevesque 242 days ago

But DuckDB works just as well with static data.

▲lugarlugarlugar 243 days ago

I think the main point is not having to store a duplicate of the 600GB of input data.

▲touisteur 242 days ago

And now I'm curious whether there's a way to actually index external files (make these queries over 600GB faster) and have this index (or many indices) be persistent. I might have missed that when I looked at the docs...

▲wenc 242 days ago

If the data is in Parquet they are already indexed in a sense. No further indexing necessary.

If they are stored in DuckDB’s native format (which I don’t use), it supports some state of the art indices.

https://duckdb.org/docs/stable/sql/indexes.html

I find Parquet plenty fast though.

▲touisteur 241 days ago

Ah thanks, of course. I was thinking of dealing with millions of (Geo)JSON files adding up to terabytes, without copying/duplicating them though, mostly indexing. I used to do that with postgres foreign data wrappers and had hopes for duckdb :-). But that's a question for SO or other forum.

▲tsss 243 days ago

For one it doesn't have the god awful pandas API.

▲hokusad 242 days ago

Yeah, that's one of my favourite aspects to it

▲joshvm 243 days ago

I haven't used duckDB but the real comparison is presumably postgis? Which is also absent from the discussion, but I think what the author alludes to.

I have no major qualm with pandas and geopandas. However I use it when it's the only practical solution, not because I enjoy using it as a library. It sounds like pandas (or similar) vs a database?

▲stevage 243 days ago

Yeah, PostGIS is readily available, and postgres is much more widely used than DuckDB. Either I don't understand OP's argument for why this is so important or I just don't buy it.

If you're using JavaScript you install Turf. The concept that you can readily install spatial libraries is hardly earth shattering.

▲tom_m 243 days ago

Convenience will always be a personal preference.

▲serjester 242 days ago

Ask anyone that's just starting out with geo pandas about their experience, and I'd be shocked if anyone calls it intuitive and straightforward. To geopandas credit, I think they just inherited many of Pandas' faults (why does a user need to understand indexes to do basic operations, no multi core support, very poor type hinting, etc).

▲jparishy 243 days ago

I work on geospatial apps and the software I think I am most excited about is https://felt.com/. I want to see them expand their tooling such that maps and data source authentication/authorization was controllable by the developer, to enable tenant isolation with proprietary data access. They could really disrupt how geospatial tech gets integrated into consumer apps.

This article doesn't acknowledge how niche this stuff is and it's a lot of training to get people to up to speed on coordinate systems, projections, transformations, etc. I would replace a lot of my custom built mapping tools with Felt if it were possible, so I could focus on our core geospatial processes and not the code to display and play with it in the browser, which is almost as big if not bigger in terms of LOC to maintain.

As mentioned by another commenter, this DuckDB DX as described is basically the same as PostGIS too.

▲dbreunig 243 days ago

Author here: the beauty of DuckDB spatial is that the projections and CRS options are hidden until you need them. For 90% of geospatial data usage people don't and shouldn't need to know about projections or CRS.

Yes, there are so many great tools to handle the complexity for the capital-G Geospatial work.

I love Felt too! Sam and team have built a great platform. But lots of times a map isn't needed; an analyst just needs it as a column.

PostGIS is also excellent! But having to start up a database server to work with data doesn't lend itself to casual usage.

The beauty of DuckDB is that it's there in a moment and in reach for data generalists.

▲korkoros 243 days ago

My experience has been that data generalists should stay away from geospatial analysis precisely because they lack a full appreciation of the importance of spatial references. I've seen people fail at this task in so many ways. From "I don't need a library to reproject, I'll just use a haversine function" to "I'll just do a spatial join of these address points in WGS84 to these parcels in NAD27" to "these North Korean missiles aren't a threat because according to this map using a Mercator projection, we are out of range."

DuckDB is great, but the fact that it makes it easier for data generalists to make mistakes with geospatial data is mark against it, not in its favor.

▲groggo 243 days ago

I remember learning about the infamous missile threat map in a GIS class -

https://georeferenced.wordpress.com/2014/05/22/worldmapblund... https://www.economist.com/asia/2003/05/15/correction-north-k...

▲jparishy 243 days ago

I think we're mostly making the same point about complexity, ya.

To me, I think it's mostly a frontend problem stopping the spread of mapping in consumer apps. Backend geo is easy tbh. There is so much good, free tooling. Mapping frontend is hell and there is no good off the shelf solution I've seen. Some too low level, some too high level. I think we need a GIS-lite that is embeddable to hide the complexity and let app developers focus on their value add, and not paying the tax of having frontend developers fix endless issues with maps they don't understand.

edit: to clarify, I think there's a relationship between getting mapping valued by leadership such that the geo work can be even be done by analysts, and having more mapping tools exist in frontend apps such that those leaders see them and understand why geo matters. it needs to be more than just markers on the map, with broad exposure. hence my focus on frontend web. sorry if that felt disjointed

▲dbreunig 243 days ago

Not disjointed at all. That last topic is the big challenge to solve.

▲febed 242 days ago

Last I checked DuckDB spatial didn’t support handling projections. It couldn’t load the CRS from a .prj file. This makes it useless for serious geospatial stuff.

▲jandrewrogers 243 days ago

> it's a lot of training to get people to up to speed on coordinate systems, projections, transformations, etc

This can mostly be avoided entirely with a proper spheroidal reference system, computational geometry implementation, and indexing. Most uses of geospatial analytics are not cartographic in nature. The map is at best a presentation layer, it is not the data model, and some don’t use a map at all. Forcing people to learn obscure and esoteric cartographic systems to ask simple intuitive questions about geospatial relationships is a big part of the problem. There is no reason this needs to be part of the learning curve.

I’ve run experiments on unsophisticated users a few times with respect to this. If you give them a fully spheroidal WGS84 implementation for geospatial analytics, it mostly “just works” for them anywhere on the globe and without regard for geospatial extent. Yes, the software implementation is much less trivial but it is qualitatively superior UX because “the world” kind of behaves how people intuit it should without having to know anything about projections, transforms, etc. And to be honest, even if you do know about projections and transforms, the results are still often less than optimal.

The only issue that comes up is that a lot of cartographic visualization toolkits are somewhat broken if you have global data models or a lot of complex geometry. Lots of rendering artifacts. Something else to work on I guess.

▲maxxen 243 days ago

Im inclined to agree, but unfortunately a huge amount of the existing data and processes in this space does not assume a spheroidal earth and come provided with a coordinate reference system. Ultimately there are also some domains where you got data that you explicitly don't want to interpret using spheroidal semantics, e.g. when working with a city plan - in which case the map _is_ the data model, and you definitely want the angles of a triangle to sum up to 180.

▲jparishy 243 days ago

Similarly I see it as it an inevitable that you will deal with a problem related to these issues, then have a massive blocker if you aren't already familiar with the details because debugging requires more than a cursory understanding for hard problems. In the alternative, you don't know the problems exist and the code is broken. I don't think it's as simple as abstracting away the entire concept, which is what I would say is too high level like I mentioned above. I don't know the right answer here honestly, I think it will be disruptive when someone figures it out

▲groggo 243 days ago

> the software implementation is much less trivial

Aren't most geospatial tools just doing simple geometry? And therefore need to work on some sort of projection?

If you can do the math on the spheroidal model, ok you get better results and its easier to intuit like you said, but it's much more complicated math. Can you actually do that today with tools like QGIS and GDAL?

▲jandrewrogers 243 days ago

Many do use simple geometry. This causes endless headaches for people who are not cartographers, they don’t expect that. The good geospatial tools usually support spheroidal models but it is not the default, you have to know to explicitly make sure it uses that (many people assume that is the default).

An additional issue is that the spheroidal implementations have undergone very little optimization, perhaps because they are not the defaults. So when people figure out how to turn them on, performance is suddenly terrible. Now you have people that believe spheroidal implementations are terribly slow, when in reality they just used a pathologically slow implementation. Really good performance-engineered spheroidal implementations are much faster than people assume based on the performance of open source implementations.

▲jofer 242 days ago

For what it's worth, you _can't_ use spherical approaches for most data. They're only used for points, in practice. Your spatial data is inherently stored/generated in ways that don't allow spherical approaches as soon as you start working with polygons, let alone things like rasters.

Yes, spherical representations of polygon data exist, but the data you import has already been "split" and undoing that is often impossible, or at best non-trivial. And then rasters are fundamentally impossible to represent that way.

Analysis uses projections for that reason. Spherical approaches aren't fundamentally "better" for most use cases. They're only strictly better if everything you're working with is a point.

There's more to geospatial than point datasets.

▲groggo 242 days ago

A good point. Certainly for raster analysis it doesn't make sense.

But any type of vector data could be modeled on a sphere, right? Points, shapes, lines. And I saw "better" because even the best suited projection will have some small amount of distortion.

Either way, most things use planer geometry so projections are necessary, and you need to have some understanding of how all that works

▲jofer 242 days ago

You can model polygons on a sphere, but the issue is that the data you're starting with is already in a cartesian representation. You actually can't easily convert between the two for complex geometries in cases where they cross the antimeridian/poles. So trying to do anything other than points is difficult in practice, unless you're natively generating data from scratch in a spherical representation, which is rate.

▲Demiurge 243 days ago

This is not really a problem, unless you’re trying to simulate some 3D space orbits, physics. The crossover from geo INFORMATION systems to geo simulation systems is a bit rough, but the projections and calculations on projected cartesian space are enough for many typical questions, like distance, area, routing. However, even topology support starts getting specialized, and the use cases are more niche. I think it’s asking a bit too much from a database/storage layer to do efficient calculations outside of those supported by GEOS. At this point, you might want to import the relevant data into higher level applications.

▲jandrewrogers 243 days ago

Speaking for myself, I was not referring to any kind of simulation systems. This is a standard requirement of many operational geospatial data models, and there are a lot of these in industry. Anything that works from a projection is a non-starter, this causes demonstrable issues for geospatial analysis at scale or if any kind of precision is required. Efficient calculation just means efficient code, there is nothing preventing this from existing in open source beyond people writing it. Yes, you may be able to get away with it if your data model is small, both geographically and data size, but that does not describe every company.

It is entirely possible to do this in databases. That is how it is actually done. The limitations of GEOS are not the limitations of software, it is not a particularly sophisticated implementation (even PostGIS doesn’t use it for the important parts last I checked). To some extent you are affirming that there is a lack of ambition in this part of the market in open source.

▲urschrei 242 days ago

I wouldn't say it's correct to say that GEOS isn't particularly sophisticated. A lot of (certainly not all) GEOS algorithms are now ported from JTS, the primary author of which is Martin Davis (aka Dr JTS), who works at Crunchy Data, who provide the PostGIS extension. So the chain (again, mostly) goes JTS -> GEOS -> {PostGIS, Shapely} -> … . Martin's work is at the cutting edge of open-source GIS-focused computational geometry, and has been for a long time (of course, industry has its own tools, but that's not what we're talking about).

I can sort of see your point about the merits of global, spheroidal geometry, certainly from a user's perspective. But there's no getting around the fact that the geometry calculations are both slower (I'm tempted to say "inherently"…) and far more complex to implement (just look at how painful it is to write a performant, accurate r- or r*-tree for spherical coordinates) along every dimension. That's not going to change any time soon, so the projection workflow probably isn't going anywhere.

▲Demiurge 242 days ago

You're right in that pretty much anything can be done via an API exposed via a database function. However, as they say... if it can be done, does mean it should? Now, I agree that having more sophisticated multi-dim calculations would be cool, but I've just rarely ran into needing or even wanting to do this, over many projects, some involving accurate simulations. In practice, database has always been for storing and querying data, which can be extremely accurate. I am probably the first person to abuse SQL and I've written some 3D ECEF rotation code in SQL, but it was a hack for a deadline, not because it was the right thing to do. All the projects I've worked with, had external models or components that did the "precise work" using complex code that I would never dare to make dependent on any database.

I'm actually curious, speaking for yourself, what kind of analysis you're doing where something like NAD83, or UTM does not give you enough precision? Is this actually "real world" geospatial data? If I have a soil model, I have a very localized analysis, and if I have a global climate model, we're talking kilometers for grid cells. In all these cases, the collected data has built in geolocation error MUCH grater than most decent projections...

So, what analysis are you doing where you need centimeter precision at global scale of thousands of kilometers? Sounds really interesting. The only time I've seen this, is doing space flight simulations where the error really accumulates into the future.

▲kashifr 242 days ago

Have you tried https://geobase.app/ they recently also had a post about duckdb integration: https://geobase.app/blog/duckdb-1-1-3

▲jparishy 242 days ago

I had not, looks pretty cool but solves the inverse of the problem as I see it. I want a backend agnostic frontend toolset that is a GIS that I can customize to my needs. I don't want to implement the tools myself, that's too low level. I don't want the service to manage, control, or own the data, that's too high level. There's a sweet spot I don't think is being hit yet.

▲stevage 243 days ago

I was just about to get into Felt then they took away the free tier and made it very expensive.

▲mtmail 243 days ago

https://atlas.co/ still has a free tier. Less features I think, depends on your use case of course.

▲lugarlugarlugar 243 days ago

[dead]

▲fastasucan 241 days ago

>As mentioned by another commenter, this DuckDB DX as described is basically the same as PostGIS too.

No, its not.

▲perrygeo 242 days ago

Why? The article is light on details. Yes, having spatial analysis combined with SQL is awesome and very natural. There's nothing special about 2D geometries that makes them significantly different from floats and strings in an RDBMS perspective - geometry is just another column type, albeit with some special operators and indexes. We've been doing it with PostGIS, Spatialite, etc for two decades at this point.

What DuckDB brings to the table is cloud-native formats. This is standard geospatial functionality attached to an object store instead of a disk. As such, it doesn't require running a database process - data is always "at rest" and available over HTTP. I'm not downplaying the accomplishment, it's really convenient. But know that this is a repackaging of existing tech to work efficiently within a cloud IO environment. If anything, the major innovation of DuckDB is in data management not geospatial per se.

▲mpalmer 242 days ago

The argument seemed pretty clear to me; more accessible tooling means more users and contributors. But as an apparent SME I could see how you'd feel clickbaited by the title.

▲perrygeo 242 days ago

Mainly a point of clarification - I don't think DuckDB represents anything new in geospatial. It represents a new paradigm for data management and data architecture. Understanding the difference isn't optional.

DuckDB handles floating point numbers too - is DuckDB the most important thing in floating point data? Of course not, the data types and operators haven't changed. The underlying compute environment has. That's where the innovation is. I'd simply appreciate if technical writers took two seconds to make this vital distinction - being precise isn't hard and I don't know what we gain by publishing intentionally muddled technical articles.

▲wodenokoto 243 days ago

I’m not sure I agree that “install geospatial” is a game changer in simplicity compared to “pip install geopandas”.

They are both one line.

▲maxxen 243 days ago

I think a big part is that duckdbs spatial extension doesnt have any transitive dependencies (except libc). It statically packages the standard suite of foss gis tools (including a whole database of coordinate systems) for multiple platforms (including WASM) and provides a unified SQL interface to it all.

(Disclaimer, I work on duckdb-spatial @duckdblabs)

▲jessekv 242 days ago

I tried out duckdb-spatial for a hobby project and my takeaway is the main thing it offers over geopandas is performance in batch processing.

If I can reduce my spatial analysis to SQL and optimize that SQL a little bit, duckdb will happily saturate my CPUs and memory to process things fast in a way that a Python script struggles to do.

Yes, the static linking is nice. It helps with the problem of reproducible environments in Python. But that's not a game changer IMO.

▲lillecarl 243 days ago

"Accessibility" is too often dismissed, yes you CAN do things with things, but getting people to do it is a craft and art. This is often where open-ish-core differs from the enterprise version of something too

▲WD-42 243 days ago

Is it that much simpler than ‘load extension postgis’? I know geos and gdal have always kinda been a pain, but I feel like docker has abstracted it all away anyway. ‘docker pull postgis’ is pretty easy, granted I’m not familiar with what else duckdb offers.

▲dbreunig 243 days ago

Yes. The difference between provisioning a server and running 'install spatial' in a CLI is night and day.

Docker has been a big improvement (when I was first learning PostGIS, the amount of time I had to hunt for proj directories or compile software just to install the plugin was a major hurdle), but it's many steps away from:

``` $ duckdb D install spatial; ```

▲Demiurge 243 days ago

What do you mean by "provisioning a server"? That's a strange requirement. You can install Postgis on a macbook in one command, or actually on all 3 major OS's in one command: "brew install postgis", "apt-get install postgresql-postgis, and "choco install postgis-9.3". Does DuckDB not require a "server" or a "computer"? What does Docker have to do with anything? This is a very confusing train of thought.

▲ 243 days ago

▲debugnik 242 days ago

> Does DuckDB not require a "server"

No, it works like SQLite.

▲Doctor_Fegg 242 days ago

PostGIS is included in Postgres.app which is a single executable for Mac. DuckDB appears also to be a single file download for Mac. I’m not sure your “when I was first learning PostGIS” experience reflects the current situation.

https://postgresapp.com/

▲frainfreeze 243 days ago

I mean I like duckdb but this feels like you're pushing for it. On my system postgis comes from apt install, and it's one command to activate the "plugin". Is the night and day part not having to run random sh script from the internet to install software on my system?

▲tomnipotent 243 days ago

DuckDB doesn't require a running server. I run duckdb in a terminal, query 10,000 CSV or parquet files and run SQL on them while joining to data hosted in sqlite, a separate duckdb file using its native format, or even Postgres.

▲dbreunig 243 days ago

That’s great! The difference is you’re familiar and know how to do that

Getting started from 0 with geo can be difficult for those unfamiliar. DuckDB packages everything into one line with one dependency.

▲_boffin_ 243 days ago

https://postgresapp.com

▲frainfreeze 243 days ago

It is not simpler. I use it with testcontainers in the notebooks usually https://testcontainers-python.readthedocs.io/en/latest/

▲twelvechairs 243 days ago

DuckDB is a great thing for geospatial but most important of the past decade? There's so many tools in different categories it wouldnt come near top for me. Some might be QGIS, postGIS (still the standard), ArcGIS online (still the standard), JS mapping tools like mapbox (i prefer deckgl), new data types like COG, geopackage and geoparquet, photogrammetry tools, 3d tiles, core libraries like gdal and now pdal, shapely, etc.

▲dbreunig 243 days ago

Most of those tools came out circa ~2000.

Yeah, I feel old.

▲twelvechairs 242 days ago

It clearly says "most important of the past 10 years" not "most important that has been invented in the past 10 years". Even taking your definition that would narrow down the list like half maybe and you should probably know that

▲ 243 days ago

▲willtemperley 242 days ago

I have some concerns regarding licensing of DuckDB and GEOS which DuckDB spatial depends on. The former is MIT licensed and the latter LGPL 2.1.

This leads to some complex situations where some builds would contravene LGPL 2.1 e.g static linking with a closed source application.

▲oreilles 242 days ago

Chiming in to promote a similar project, a geospatial extension for Polars [1] I'm working on. It's not stable yet (abeit pretty close to), but is already pretty feature complete (it uses GEOS and PROJ as a backend, so has parity with GeoPandas).

[1] https://github.com/oreilles/polars-st/

▲serjester 242 days ago

I think this is part of a broader trend of geospatial data just becoming easier to work with. DuckDB is great for quick ad hoc stuff, but I find polars to be easier to maintain. Personally, I'm really excited for polars to (eventually) add true geospatial support. In the meantime, creating a custom h3 plugin only took a couple days and it simplified massive parts of our old geo pandas / duckdb code. The faster we can completely get rid of geo pandas, the better.

[1] https://github.com/Filimoa/polars-h3

▲mettamage 242 days ago

Is it possible with polars to store data more efficiently than a database though?

I work with polars, but I haven't delved too deep into the performance characteristics versus postgres or anything like that.

▲fifilura 242 days ago

I have been using Trino/AWS Athena for some geospatial work.

It's API is not very well covered, some parts are still missing, but it must just be a matter of time.

Where it shines is when you need to do an O(n2) or O(nm) type of calculation. Then those 100s of free CPU cores really come in handy! And the end result is pretty often a dollar for CPU-days worth of computation.

Example of O(nm) calculation are things like finding the closest road segment inside a tile (or more likely a tile and it's surrounding tiles), for each point in a list.

▲feverzsj 243 days ago

If you are doing lots of spatial index queries, it's actually much slower than SpatiaLite. Because DuckDB uses column-wise storage.

▲elchief 242 days ago

I wish to god they would allow 2 connections. One read-only for my BI tool, and one read-write for dbt/sqlmesh

▲kianN 242 days ago

I think using a pyarrow dataset as an intermediary would allow for a zero copy read/write from one connection into a second connection.

▲dmillar 242 days ago

DuckDB > geopandas, certainly for anything out of core. Though, I recently gave up on importing 70GB worth of large multipolygons (from a csv in hex wkb), and just used a postgis container. In concert with DuckDB's growth, I'd also mark the advent of geoparquet.

The big change, in my view, over the past decade in GIS software, is in compute and storage efficiency across the typical stack. DuckDB has become a part of this, but h/t to the advances from shapely, geopandas, geoparquet, and GDAL. There's a lot of overlap in that venn diagram, and credit should be spread around. QGIS is great, too, though I feel there is market opportunity to apply 90/10 to its massive feature set and move it to the web.

▲patja 243 days ago

SQL Server has geospatial capabilities without any extensions or add-ons. I've been happily using geospatial datatypes on the free Express version for years, probably well over a decade.

▲kriro 242 days ago

That's a pretty grandiose statement and frankly the kind of advertisement I'm not a fan of at all.

If you want to import something and work with it GeoPandas exists. If you want something integrated with a SQL database, PostGIS exists.

On the application side of the spectrum, GRASS GIS, QGIS etc. say hi. They are being used in the agriculture industry and by government agencies (at least I know that's the case in Germany and Brazil).

▲vincnetas 242 days ago

DBeaver (db client) has built in support for displaying geo data. In my case postgis results. I see that duckdb spatial functions are almost identical to postgis ones.

https://dbeaver.com/docs/dbeaver/Working-with-Spatial-GIS-da...

▲fidotron 243 days ago

Honestly, I think it's actually https://www.uber.com/en-CA/blog/h3/

▲jsemrau 243 days ago

Makes one wonder if the YOLO algorithm would work better with hexagons.

"Hexagons were an important choice because people in a city are often in motion, and hexagons minimize the quantization error introduced when users move through a city. Hexagons also allow us to approximate radiuses easily, such as in this example using Elasticsearch."

[Edit]Maybe https://www.researchgate.net/publication/372766828_YOLOv8_fo...

▲sroerick 242 days ago

Could you elaborate on this? I experimented with h3 a bit for queries but I have never used it in production. Obviously, it's very effective for Uber but I wonder if you have any other experience with it

▲aynyc 242 days ago

How big are the data sets? I've been trying to get duckdb to work in our company on financial transactions and reporting data. The dataset is around 500GB CSV in S3 and duckdb chokes on it.

▲wenc 242 days ago

CSV is a pretty bad format any engine will choke on it. It basically requires a full table scan to get at any data.

You need to convert it into Parquet or some columnar format that lets engines do predicate pushdowns and fast scans. Each parquet file stores statistics about the data it contains so engines can quickly decide if it’s worth reading the file or skipping it altogether.

▲nojito 242 days ago

CSV are a poor format to access from S3.

Should convert them to parquet then access and analytics becomes cheap and fast.

▲aynyc 242 days ago

I agree. That's how our data is produced. We constantly generate real time data into CSV. As far as I can tell, I can't append to parquet file.

▲wenc 242 days ago

Parquet files are already built for append only. Just add a new file.

This is a new paradigm for folks who aren’t in big data — the conventional approach usually involves doing a row INSERT. In big data, appending simply means adding a new file - the database engine will immediately recognize its presence. This is why “select * from ‘*.parquet’” will always operate on the latest dataset.

▲aynyc 242 days ago

Wait, so I create a new file for every message?

▲wenc 242 days ago

Typically small data is batched. While theoretically you could, I wouldn't create 1 file per row (there would be too many files and your filesystem would struggle). But maybe you can batch 1 day's worth of data (or whatever partitioning works for your data) and write to 1 parquet file?

For example, my data is usually batched by yearwk (year + week no), so my directory structure looks like this:

  /data/yearwk=202501/000.parquet
  /data/yearwk=202502/000.parquet

This is also called the Hive directory structure. When I query, I just do:

  select * from '/data/**/*.parquet';

This is a paradigm shift from standard database thinking for handling truly big data. It's append-only by file.

500GB in CSVs doesn't sound that big though. I'm guessing when you convert to Parquet (a 1-liner in DuckDB, below) it might end up being 50GBs or so.

  COPY (FROM '/data/*.csv') TO 'my.parquet' (FORMAT PARQUET);

▲aynyc 241 days ago

I was really surprised duckdb choked on 500GB. That's maybe a week's worth of data.

The partitioning of partquet files might be an issue as not all data are neatly partitioned by date. We have trades with different execution dates, clearance dates and other date values that we need query on.

▲wenc 241 days ago

It doesn’t usually choke on 500 gb of data. I query 600 gb (equivalent to a few TBs of CSVs?) of parquets daily. It’s not the size of the data. It’s the type of data.

If date partitioning doesn’t work, just find another chunking key. The key is to get it into parquet format. CSV is just hugely inefficient.

Or spin up a larger compute instance with more memory. I have 256gb on mine.

I tried running an Apache Spark job (8 machine cluster) on a data lake of 300 Gb of TSVs once. This was a distributed cluster. There was one join in it. It timed out after 8 hours. I realized why — Spark had to do many full table scans of the TSVs and it was just so inefficient. CSV formats are ok for straight up reads, but any time you have to do analytics operations like aggregate or join them at scale, you’re in for a world of pain.

DuckDB has better CSV handling than Spark but a large dataset in a poor format will stymie any engine.

▲aynyc 241 days ago

We have a spark cluster too. Then switch to Athena. I just dislike the cost structure.

The problem with disk based partition is keys are difficult to manage properly.

▲wenc 241 days ago

Did Athena on CSV work for you? I've used Athena and it struggles with CSV at scale too.

Btw I'm not suggesting to use Spark. I'm saying that even Spark didn't work on large TSV datasets (it only takes a JOIN or GROUP BY to kill the query performance). The CSV data storage format is simply the wrong one for analytics.

Partitioning is irreversible, but coming up with a thoughtful scheme isn't that hard. You just need to hash something. Even something as simple as a HNV hash on some meaningful field is sufficient. In one of my datasets, I chunk it by week, then by HNV modulo 50 chunks, so it looks like this:

/yearwk=202501/chunk=24/000.parquet

Ask an LLM to suggest partioning scheme or think of one.

CSV is the mistake. The move here is to get out of CSV. Partitioning is secondary -- partitioning here is only used for chunking the Parquet, nothing else. You are not locked into anything.

▲aynyc 240 days ago

Yes, on Athena, we process much larger CSV files. But the cost is too crazy. We also have ORC and Parquet files for other dataset which we process with EMR Spark. I really want to get off those distributed analytic engines whenever possible.

I have to think about partition, Spark/Athena both had issues with partitioning by received date. They are scanning way too much data.

▲snake_doc 242 days ago

Are you querying from an EC2 instance close to the S3 data? Are the CSVs partitioned into separate files? Does the machine have 500GB of memory? It’s not always duckdb fault when there can be a clear I/O bottleneck…

▲aynyc 242 days ago

No, the EC2 instance doesn't have 500GB of data. Does DuckDB require that? I actually downloaded the data from S3 to local EBS and still choked.

▲broner 242 days ago

Works fine for me on TB+ datasets. Maybe you were doing in-memory rather than persistent database and running out of RAM? https://duckdb.org/docs/stable/clients/cli/overview.html#in-...

▲aynyc 242 days ago

Wait, do you insert the data from S3 into duckdb? I was just doing select from file.

▲broner 241 days ago

Nope, just reading from S3. Check this out: https://duckdb.org/2024/07/09/memory-management.html

▲fastasucan 241 days ago

Maybe its your terminal that chockes because it tries to display to much data? 500GB should be no problem.

▲higeorge13 242 days ago

Could you test with clickhouse-local? It always works better for me.

▲aynyc 242 days ago

No, clickhouse is not considered for some other reason. But I think I might revisit it sometime in the future.

▲badmonster 243 days ago

How might embedding spatial capabilities directly into general-purpose data tools like DuckDB reshape who participates in geospatial analysis—and what kinds of problems they choose to solve?

▲quasarj 242 days ago

I've been out of the loop on DuckDB, where can I get a real overview of what the excitement is about? It just looks like a new sqlite from the CLI...

▲Groxx 242 days ago

It's essentially a column-oriented sqlite, so it's much faster for some kinds of queries. There's a big market for that category of queries, and nobody wants to run their own hadoop cluster to do single-machine-scale stuff.

▲ 242 days ago

▲bingaweek 243 days ago

We need a "come on" clause for these absurd headlines. Come on.

▲ 243 days ago

▲isuckatcoding 242 days ago

I was expecting example queries but all I got was how to install a package :(

▲jeffbee 243 days ago

Ehh I tried to do some spatial stuff but there just wasn't enough there, or I could not figure out how to use it. Loading spatial information into ipython and fiddling with it is well-traveled and it doesn't seem to me that SQL is an inherently lower hurdle for the user.

▲WD-42 243 days ago

Uhoh, another pushover-licensed database. I wonder when it will begin it’s own redis saga.

▲jandrewrogers 243 days ago

I think geospatial analytics is important (because of course I would), but to be frank geospatial software has been stagnant for a long time. Every new thing is just a fresh spin on the same stagnant things we already have. This more or less says exactly this?

For geospatial analysis, the most important thing that could happen in software would be no longer treating it, either explicitly or implicitly, as having anything to do with cartography. Many use cases are not remotely map-driven but the tools require users to force everything through the lens of map-making.

▲dbreunig 243 days ago

I was struck by this as people suggest alternatives that refute the headline (QGIS, PostGIS, GDAL, etc): nearly every one emerged in the early 2000s.

Strongly agree with your sentiment around maps: most people can’t read them, they color the entire workflow and make it more complex, and (imo) lead to a general undervaluing of the geospatial field. Getting the data into columns means it’s usable by every department.

▲azinman2 243 days ago

Can you give some examples?

▲jandrewrogers 243 days ago

Of the stagnation? I’ve been doing geospatial analytics for over 20 years and shockingly little has changed, both in features and capability. Given the amount of time that has passed and the vastly expanded scope of the geospatial data models people are working with today, I think most people would expect more to have changed.

▲azinman2 243 days ago

No I meant to this: “Many use cases are not remotely map-driven but the tools require users to force everything through the lens of map-making.”

▲cyanydeez 243 days ago

No. QGIS is.

Good god.

▲dbreunig 243 days ago

Author here.

QGIS is amazing. It's really great. It also came out in 2002, so I think the headline is safe.

▲cyanydeez 243 days ago

Nope, it's constantly being improved and still wins the decade. Do you disqualify it because it existed? Re-read your headline, it's definitely not qualifying what you think it's qualifying.

▲yoyopa 243 days ago

[dead]

▲fithisux 243 days ago

I agree