Fresh Hacker News | Loading a trillion rows of weather data into TimescaleDB

▲Loading a trillion rows of weather data into TimescaleDB(aliramadhan.me)

405 points by PolarizedPoutin 13 days ago | 37 comments

▲ZeroCool2u 13 days ago

I've done a good amount of geospatial analysis for work.

One thing you quickly realize with geospatial data is that it's incredibly nuanced. You have to be quite careful about understanding which coordinate reference system (CRS) and for visualization which projection is being used. The CRS is somewhat paranoia inducing if you don't have great infrastructure setup with the right tools to carry that metadata with your geospatial data.

I've tested everything AWS has to offer, Postgres/PostGIS, Spark/DB, Snowflake, Trini, and ARCGis.

I'm convinced the best tool for large scale geospatial work is Google BigQuery and it's not even close. It took an expensive multi hour query running on PostGIS deployed on an enormous m6a EC2 instance to less than 5 seconds that ran in the BigQuery free tier. It does make sense if you think about it, Google was early with Maps, but it is just stunning how much better they are in this specific niche domain.

This was using publicly available FEMA data that Snowflake and AWS services would just choke on, because the geometry column exceeded their maximum byte size. Spark doesn't even have geospatial data types and the open source packages that add support leave a lot to be desired.

This guy is running on-prem, so maybe this made sense, but I just would never bother. The storage for BQ would probably be less than $100/months for 20 TB.

▲ingenieroariel 13 days ago

I went through a similar phase with a process that started with global OSM and Whosonfirst to process a pipeline. Google costs kept going up (7k a month with airflow + bigquery) and I was able to replace it with a one time $7k hardware purchase. We were able to do it since the process was using H3 indices early on and the resulting intermediate datasets all fit on ram.

System is a Mac Studio with 128GB + Asahi Linux + mmapped parquet files and DuckDB, it also runs airflow for us and with Nix can be used to accelerate developer builds and run the airflow tasks for the data team.

GCP is nice when it is free/cheap but they keep tabs on what you are doing and may surprise you at any point in time with ever higher bills without higher usage.

▲nojvek 13 days ago

DuckDB is the real magic. On an nvme disk with decent amount of RAM, it goes brrrrrr.

I would love it if somehow Postgres got duckdb powered columnstore tables.

I know hydra.so is doing columnstores.

DuckDB being able to query parquet files directly is a big win IMO.

I wish we could bulk insert parquet files into stock PG.

▲jfim 13 days ago

BigQuery is nice but it's definitely a major foot-gun in terms of cost. It's surprisingly easy to rack up high costs with say a misconfigured dashboard or a developer just testing stuff.

▲mrgaro 12 days ago

Definitively agree here. Once the data is in BigQuery, people will start doing ad-hoc queries and building Grafana dashboards on top of it.

And sooner or later (usually sooner) somebody will build a fancy Grafana dashboard and set it to refresh every 5 second and you will not notice it until it's too late.

▲dekhn 13 days ago

Frankly I think this is just a sign that it's a power tool for power users.

▲carlhjerpe 13 days ago

Sadly my colleagues aren't always "power users"

▲brailsafe 13 days ago

Nobody starts as a power user

▲ZeroCool2u 13 days ago

That is a very cool setup!

My org would never allow that as we're in a highly regulated and security conscious space.

Totally agree about the BQ costs. The free tier is great and I think pretty generous, but if you're not very careful with enforcing table creation only with partitioning and clustering as much as possible, and don't enforce some training for devs on how to deal with columnar DB's if they're not familiar, the bills can get pretty crazy quickly.

▲fikama 12 days ago

You made me curious. Since you are using Linux, why Mac and not PC? Wouldn't PC be cheaper? Or was there any other factors?

▲zamadatix 11 days ago

My stab: A Mac Studio will have 400 GB/s or 800 GB/s of memory bandwidth. Not that you can't get there on x86 e.g. a 12 channel Epyc Genoa setup can do 460 GB/s or 920 GB/s total when doubled up but now you're talking about buying 2 latest gen Epycs and 24 high speed dims to get the raw bandwidth back all while ignoring the access is a bit different.

Curious to see if there were other reasons.

▲hawk_ 13 days ago

> and may surprise you at any point in time with ever higher bills without higher usage.

What? really? Do they change your pricing plan? How can they charge more for the same usage?

▲ingenieroariel 13 days ago

When you queried their 'Open Data' datasets and linked with your own it was absurdly cheap for some time. Granted we used our hacking skills to make sure the really big queries ran in the free tier and only smaller datasets got in the private tables.

I kept getting emails about small changes and the bills got bigger all over the place including BigQuery and how they dealt with queries on public datasets. Bill got higher.

There is a non zero chance I conflated things. But from my point of view: I created a system and let it running for years - afterwards bills got higher out of the blue and I moved out.

▲johnymontana 13 days ago

> Spark doesn't even have geospatial data types and the open source packages that add support leave a lot to be desired.

Could you say more about this? I'm curious if you've compared Apache Sedona [0] and what specifically you found lacking? I currently work at Wherobots [1], founded by the creators of Apache Sedona and would love to hear any feedback.

[0] https://sedona.apache.org/latest/

[1] https://wherobots.com/

▲xyzzy_plugh 13 days ago

Do you mind linking the specific dataset? I agree very wide columns break a lot of tools but other columnar postgres forks should support this no problem. It sounds like you didn't use Redshift, which I find surprising as it directly competes with BQ. Redshift has "super" columns that can be very large, even larger than the maximum supported by BigQuery.

I constantly see folks finding out the hard way that PostGIS is really hard to beat. The fact that Trini/Presto and Spark have languished here is particularly telling.

▲ZeroCool2u 13 days ago

It's FEMA's NFHL. I can't recall the specific layer of the GDB file, but you could probably figure it out. Try loading up Iowa into redshift and if that works for you I'd be quite surprised.

My org has a very large AWS spend and we got to have a chat with some of their SWE's that work on the geospatial processing features for Redshift and Athena. We described what we needed and they said our only option was to aggregate the data first or drop the offending rows. Obviously we're not interested in compromising our work just to use a specific tool, so we opted for better tools.

The crux of the issue was that the large problem column was the geometry itself. Specifically, MultiPolygon. You need to use the geometry datatype for this[1]. However, our MultiPolygon column was 10's to 100's of MB's. Well outside the max size for the Super datatype from what I can tell as it looks like that's 16 MB.

[1]: https://docs.aws.amazon.com/redshift/latest/dg/GeometryType-...

▲beeboobaa3 13 days ago

Just split it out into multiple polygons, one per row. If you're using a relational database, do as relational databases do.

▲xyzzy_plugh 13 days ago

The description you provide appears too large for even BigQuery.

▲efxhoy 12 days ago

We just ported our ~500gb data warehouse from an 8 core postgres rds instance to bigquery. Rebuilds went from 5 hours to 11 minutes and costs are about the same or lower. and that’s with caching of some very big tables in postgres and rebuilding from scratch in bq.

I much prefer postgres as a tool I can run myself but the order of magnitude performance benefits are hard to argue with.

▲valyala 11 days ago

Did you try ClickHouse? It should provide data ingestion and querying performance comparable to Google BigQuery, while it should cost less, since you can run it on your hardware.

▲efxhoy 11 days ago

No unfortunately not. We tried out snowflake and bigquery and landed on bigquery mainly because of how well it integrated with existing company workflows (a looot of google sheets floating around the org).

▲Cthulhu_ 13 days ago

I hear so much good things about BigTable / BigQuery, it's a shame I've had no opportunity to use it yet.

▲detourdog 13 days ago

I’m glad to hear this first hand experience. I’m pretty sure that want to build and refine my own geospatial data horde.

I wonder if you think a longterm project is better rolling their own. The second priority is that I expect it all to be locally hosted.

Thanks for any considerations.

▲ZeroCool2u 13 days ago

Frankly I think for long term hoarding BQ is hard to beat. The storage costs are pretty reasonable and you never pay for compute until you actually run a query, so if you're mostly just hoarding, well, you're probably going to save a lot of time, money, and effort in the long run.

▲rtkwe 13 days ago

Even more fun I bet would be if you're getting data with different reference spheroids.

▲PolarizedPoutin 13 days ago

Thank you for the insights! Yeah I'm still not sure how Postgres/PostGIS will scale for me, but good to know that BigQuery does this nicely.

This is not something I'm productionizing (at least not yet?) and I'm giving myself zero budget since it's a side project, but if/when I do I'll definitely look into BigQuery!

▲ZeroCool2u 13 days ago

Good luck! They have some really great tutorials on how to get started with BQ and geospatial data. One other nuance of BigQuery that doesn't seem to apply to many other tools in this space is that you can enable partitioning on your tables in addition to clustering on the geometry (Geography in BQ) column.

https://cloud.google.com/bigquery/docs/geospatial-data#parti...

▲winrid 13 days ago

M6a is not even remotely enormous. Also were you using EBS?

▲ZeroCool2u 13 days ago

m6a.48xlarge: 192 vCPU & 768 GiB

If that's not a large node for you, well you're living in a different world from me. Yes to using EBS.

▲winrid 13 days ago

Thanks for the info. The issue is using EBS. If you used an instance with NVME drives it would probably have been faster than BQ (and you aren't billed per-query...). I would suggest an R or I4 class instance for this, m6 is not good for the money here.

You would just have to setup replication for backups, but this could just be rsync to EBS or some other replication solution depending on your database.

▲sgarland 13 days ago

Unless you’re CPU-bound (unlikely), the r-family is usually a better fit for RDBMS, IME. 8x RAM:vCPU ratio, compared to 4x for m-family.

Then there’s the x-family, which can go up to bonkers levels, like 4 TiB of RAM (and local NVMe).

As a sibling comment mentioned, though, if you can fit the data into local storage, that’s the way to do it. EBS latency, even with io2.blockexpress, simply cannot compete.

That said, if I did the math correctly based on the 71 GB/31 days footnote, you’re looking at about 2.15 PB to load the entire thing, so, uh, good luck.

▲ayewo 13 days ago

In your experience, is the r-family better than the c-family for running an RDBMS?

▲sgarland 13 days ago

Yes. Most RDBMS are memory-bound (indexes, connection overhead, buffer pool…), so the more, the better.

At my last job, I switched a large-ish (100K QPS at peak) m5 MySQL instance to a smaller (in AWS numbering) r6i that was cheaper, despite having the same amount of RAM and being generation newer. That, combined with careful tuning and testing, resulted in queries speeding up 20-40%, AND we then had plenty of room for vertical scaling again if necessary.

▲aPoCoMiLogin 13 days ago

things are changing recently in aws, but few years ago R-family instances had one of the fastest uplink to EBS. for example only the larges M4 instance (m4.16xlarge) has 10gbps uplink, versus R5b where it starts from 10gbps (for the lowest tier) and ends on 60gbps @ 260k IOPS. you can very easily choke EBS with DB.

EDIT: only newer C instances have comparable uplink to EBS, C5 or C4 (and some C6) starts from ~4.7gbps. just compare the EBS bandwidth column in https://aws.amazon.com/ec2/instance-types/

▲sgarland 13 days ago

The other fun thing about AWS instances is that the network uplink speed isn’t always what’s advertised. There is of course the “Up to X Gbps” levels (30 minutes of rated speed guaranteed every 24 hours), but there are also other limits, like cross-region speed being capped at 50% of rated capacity.

This rarely matters until it does, like if you’re setting up an active-active DB across regions. Then it’s a fun surprise.

▲jamesgresql 13 days ago

This is super cool! I run DevlRel @ Timescale, and I love seeing our community create well written posts like this!

My initial reaction is that I think one of the reasons you're seeing a hypertable being slower is almost certainly that it creates an index on the timestamp column by default. You don't have an index on your standard table which lets it go faster.

You can use create_hypertable with create_default_indexes=>false to skip creating the index, or you can just drop the index before you ingest data. You'll eventually want that index - but it's best created after ingestion in a one-shot load like this.

I'd also be interested in how the HDD you're reading data from is holding up in some of the highly parallel setups?

▲PolarizedPoutin 13 days ago

Thank you for reading and for your kind words!

Ah I did not know about the `create_default_indexes=>false` and that a time index is created by default for hypertables. I'll add a note to explain this! Also curious to benchmark inserting without the time index then creating it manually.

Even with 32 workers I think the HDD was fine. I did monitor disk usage through btop and the SSD that Postgres lived on seemed to be more of a bottleneck than the HDD. So my conclusion was that a faster SSD for Postgres would be a better investment than moving the data from HDD to SSD.

▲counters 13 days ago

Why?

Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!

The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.

[1]: https://cloud.google.com/storage/docs/public-datasets/era5

▲PolarizedPoutin 13 days ago

The main reason why was that it's a personal project and I wanted to do everything on my home server so that I wouldn't have to pay for cloud resources, and so that I could learn Postgres, TimescaleDB, and eventuallly PostGIS.

But as rabernat pointed out in his comment, pulling out a long time series from the cloud replica is also slow. And I know I eventually want to perform complex spatio-temporal queries, e.g. computing the 99% percentile of summer temperatures in Chile from 1940-1980.

I don't doubt that a cloud replica can be faster, but it's at odds with my budget of $0 haha.

▲roter 13 days ago

I too need to do percentiles. One option is loop through the grids but bin/histogram it. You'll get a really good 99% from a 1 Kelvin bin width.

Also, I've found the diurnal profile from ERA5 analysis can be abysmal in some locations. ERA5-Land is much better, high resolution, though only available over... er... land.

To your point about not relying on cloud. Noted in the Google option [1] link above:

> Update Frequency: The ERA5 dataset is currently not refreshed in the Google Cloud Public Dataset Program. The program provides ERA5 data spanning from 1940 to May 2023.

Another alternative, Amazon [2], also deprecated:

> The provider of this dataset will no longer maintain this dataset. We are open to talking with anyone else who might be willing to provide this dataset to the community.

[2] https://registry.opendata.aws/ecmwf-era5/

▲PolarizedPoutin 12 days ago

Thanks for pointing out the status of ERA5 on GCP and AWS.

I'll have to try the binning and compare it with listing all the values and finding the 99th percentile! But yeah the spatial part of that query might be the tougher part.

And yeah I also considered whether to use ERA5 or ERA5-Land. I don't remember if ERA5-Land was missing a variable or I just wanted to look at some ocean grid cells, but I decided to start with vanilla ERA5.

▲rabernat 13 days ago

True, but in fact, the Google ERA5 public data suffers from the exact chunking problem described in the post: it's optimized for spatial queries, not timeseries queries. I just ran a benchmark, and it took me 20 minutes to pull a timeseries of a single variable at a single point!

This highlights the needs for timeseries-optimized chunking if that is your anticipated usage pattern.

▲bitschubser_ 12 days ago

a good source for ERA5 historical data is https://open-meteo.com/en/docs/historical-weather-api (not affiliated, just a happy user) you can also run open-meteo locally, its quite fast for spatial and timeseries queries

▲open-meteo 12 days ago

Creator of Open-Meteo here. There is small tutorial to setup ERA5 locally: https://github.com/open-meteo/open-data/tree/main/tutorial_d...

Under the hood Open-Meteo is using a custom file format with time-series chunking and specialised compression for low-frequency weather data. General purpose time-series databases do not even get close to this setup.

▲orhmeh09 13 days ago

That might be nice if someone would do it and teach others to use it. Some labs have an RDBMS-based pipeline with published algorithms and data that nobody wants to try to reimplement (and which nobody would be paid to do). About the best improvement we could get was moving from an ancient version of MySQL to Postgres + PostGIS. I think Timescale would have helped. There were other reasons also to run locally due to privacy, cluster access, funds etc.

▲rabernat 13 days ago

Great post! Hi Ali!

I think what's missing here is an analysis of what is gained by moving the weather data into a RDBMS. The motivation is to speed up queries. But what's the baseline?

As someone very familiar with this tech landscape (maintainer of Xarray and Zarr, founder of https://earthmover.io/), I know that serverless solutions + object storage can deliver very low latency performance (sub second) for timeseries queries on weather data--much faster than the 30 minutes cited here--_if_ the data are chunked appropriately in Zarr. Given the difficulty of data ingestion described in this post, it's worth seriously evaluating those solutions before going down the RDBMS path.

▲PolarizedPoutin 13 days ago

Hey Ryan and thank you for the feedback!

I agree that storing the data is appropriately chunked Zarr files is almost surely going to be faster, simpler to set up, and take up less space. Could even put up an API in front of it to get "queries".

I also agree that I haven't motivated the RDBMS approach much. This is mainly because I took this approach with Postgres + Timescale since I wanted to learn to work with them, and playing around with ERA5 data seemed like the most fun way. Maybe it's the allure of weather data being big enough to pose a challenge here.

I don't have anything to back this up but I wonder if the RDBMS approach, with properly tuned and indexed TimescaleDB + PostGIS (non-trivial to set up), can speed up complex spatio-temporal queries, e.g. computing the 99th percentile of summer temperatures in Chile from 1940-1980, in case many different Zarr chunks have to be read to find this data. I like the idea of setting up different tables to cache these kinds of statistics, but it's not that hard to do with Zarr either.

I'm benchmarking queries and indexes next so I might know more then!

▲ohmahjong 13 days ago

This is a bit off-topic but I'm interested in the same space you are in.

There seems to be an inherent pull between large chunks (great for visualising large extents and larger queries) vs smaller chunks for point-based or timeseries queries. It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets. I have heard of "kerchunk" being used to try and get the best of both, but then I _think_ you lose out on the option of compressing the data and it introduces quite a lot of complexity.

What are your thoughts on how to strike that balance between use cases?

▲rabernat 13 days ago

> It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets.

Like all things in tech, it's about tradeoffs. S3 storage costs about $275 TB a year. Typical weather datasets are ~10 TB. If you're running a business that uses weather data in operations to make money, you could easily afford to make 2-3 copies that are optimized for different query patterns. We see many teams doing this today in production. That's still much cheaper (and more flexible) than putting the same volume of data in a RDBMS, given the relative cost of S3 vs. persistent disks.

The real hidden costs of all of these solutions is the developer time operating the data pipelines for the transformation.

▲ohmahjong 12 days ago

That's a great point, it really is all about tradeoffs. In my use case there is strong motivation to keep data creation times low, so writing out multiple datasets comes at a product/opportunity cost moreso than a storage cost. Thanks for the insight.

▲postgresperf 13 days ago

Contributor to the PG bulk loading docs you referenced here. Good survey of the techniques here. I've done a good bit of this trying to speed up loading the Open Street Map database. Presentation at https://www.youtube.com/watch?v=BCMnu7xay2Y for my last public update. Since then the advance of hardware, GIS improvements in PG15, and osm2pgsql adopting their middle-way-node-index-id-shift technique (makes the largest but rarely used index 1/32 the size) have gotten my times to load the planet set below 4 hours.

One suggestion aimed at the author here: some of your experiments are taking out WAL writing in a sort of indirect way, using pg_bulkload and COPY. There's one thing you could try that wasn't documented yet when my buddy Craig Ringer wrote the SO post you linked to: you can just turn off the WAL in the configuration. Yes, you will lose the tables in progress if there's a crash, and when things run for weeks those happen. With time scale data, it's not hard to structure the loading so you'll only lose the last chunk of work when that happens. WAL data isn't really necessary for bulk loading. Crash, clean up the right edge of the loaded data, start back up.

Here's the full set of postgresql.conf settings I run to disable the WAL and other overhead:

wal_level = minimal max_wal_senders = 0 synchronous_commit = off fsync = off full_page_writes = off autovacuum = off checkpoint_timeout = 60min

Finally, when loading in big chunks, to keep the vacuum work down I'd normally turn off autovac as above then issue periodic VACUUM FREEZE commands running behind the currently loading date partition. (Talking normal PG here) That skips some work of the intermediate step the database normally frets about where new transactions are written but not visible to everyone yet.

▲PolarizedPoutin 12 days ago

Thank you for reading through and for your feedback! Excited to try your settings to disable the WAL and other overhead and see if I get even faster inserts.

Also glad to hear an expert say that WAL data isn't really necessary for bulk loading, especially with chunks. I should get through the ~20 days it takes to load data without a power outage haha (no UPS yet :() but it sounds like even in the worst case I can just resume.

▲kabes 12 days ago

Do you have more info on the GIS improvements in PG15?

▲postgresperf 12 days ago

There's a whole talk about it we had in our conference: https://www.youtube.com/watch?v=TG28lRoailE

Short version is GIS indexes are notably smaller and build faster in PG15 than earlier versions. It's a major version to version PG improvement for these workloads.

▲carderne 13 days ago

Hey OP (assuming you're the author), you might be interested in this similar experiment I did about four years ago, same dataset, same target, similar goal!

https://rdrn.me/optimising-sql/

Similar sequence of investigations, but using regular Postgres rather than Timescale. With my setup I got another ~3x speedup over COPY by copying binary data directly (assuming your data is already in memory).

▲PolarizedPoutin 13 days ago

Wish I saw this before I started haha! I left a footnote about why I didn't try binary copy (basically someone else found its performance disappointing) but it sounds like I should give it a try.

footnote: https://aliramadhan.me/2024/03/31/trillion-rows.html#fn:copy...

▲carderne 13 days ago

Yeah I imagine it depends where the data is coming from and what exactly it looks like (num fields, dtypes...?). What I did was source data -> Numpy Structured Array [0] -> Postgres binary [1]. Bit of a pain getting it into the required shape, but if you follow the links the code should get you going (sorry no type hints!).

[0] https://rdrn.me/optimising-sampling/#round-10-off-the-deep-e... [1] In the original blog I linked.

▲anentropic 13 days ago

I'd love to hear from anyone who's done the same in MySQL

▲PolarizedPoutin 12 days ago

Had a read through parts 1 and 2, thank you for the engaging reads! Love how you've formatted your posts with the margin notes too. Thank you for providing the function to write numpy structured arrays to Postgres binary, I couldn't figure this out before.

▲d416 13 days ago

“Is a relational database even appropriate for gridded weather data? No idea but we’ll find out.”

Love this. It’s the exact opposite of all the other ‘well actually’ mainstream tech posts and I am here for all of it. Props for keeping the reader fully engaged on the journey.

▲PolarizedPoutin 12 days ago

Haha thank you for reading and glad you found it engaging! Maybe it's the benefit of being a beginner and not having any skin in the game. I did a lot of searching but couldn't find any conclusive answers for my use case so figured I may as well run the benchmarks myself.

▲lawn 13 days ago

What an interesting post!

> At a sustained ~462k inserts per second, we’re waiting ~20 days for our ~754 billion rows which is not bad I guess It’s less time than it took me to write this post.

Hah, as I've been gravitating more to writing larger and more in depth blog posts I can relate to the surprising effort it can require.

▲PolarizedPoutin 13 days ago

Thank you! Yeah haha some of the benchmarks took several hours (and a few re-runs) and there was a lot of learning done along the way.

▲RyanHamilton 13 days ago

If you want to plot time-series charts or many other charts directly from sql queries, qStudio is a free SQL IDE and works with everything including TimescaleDB: https://www.timestored.com/qstudio/database/timescale Disclaimer, I am the author.

▲ayewo 13 days ago

What's the process for adding support for other databases to your tool qStudio?

I'm thinking perhaps you could add support for Timeplus [1]? Timeplus is a streaming-first database built on ClickHouse. The core DB engine Timeplus Proton is open source [2].

It seems that qStudio is open source [3] and written in Java and will need a JDBC driver to add support for a new RDBMS? If yes, Timeplus Proton has an open source JDBC driver [4] based on ClickHouse's driver but with modifications added for streaming use cases.

1: https://www.timeplus.com/

2: https://github.com/timeplus-io/proton

3: https://github.com/timeseries/qstudio

4: https://github.com/timeplus-io/proton-java-driver

▲PolarizedPoutin 12 days ago

Thank you for the link! I've mostly used TablePlus (not free I think) and matplotlib (via psycopg3) to plot data returned by queries, but this looks like it'll be faster to use. I've only inserted data so far haha but will be querying and plotting soon.

▲roter 13 days ago

I too use the ERA5 reanalysis data and I too need quick time series. As the data comes in [lat, lon] grids, stacked by whatever period you've chosen, e.g. [month of hourly data, lat, lon], it becomes a massive matrix transpose problem if you want 20+ years.

What I do is download each netCDF file, transpose, and insert into a massive 3D HDF file organized as [lat, lon, hour]. On my workstation it takes about 30 minutes to create one year for one variable (no parallel I/O or processes) but then takes milliseconds to pull a single (lat, lon) location. Initial pain for long-term gain. Simplistic, but I'm just a climatologist not a database guru.

▲PolarizedPoutin 12 days ago

Haha simplistic but probably faster and more space-efficient than a relational database. Sounds like rabernat and open-meteo who commented here do something similar to you and find it fast as well!

▲gingerwizard 12 days ago

Few in this thread have suggested ClickHouse would do well here. We tested 1 trillon rows recently, albeit much simpler data - https://clickhouse.com/blog/clickhouse-1-trillion-row-challe...

This is a good dataset though and the level of detail in the post is appreciated. I'll give ClicKHouse a go on the same...

Disclaimer: I work for ClickHouse

▲PolarizedPoutin 12 days ago

Thanks for the link on the trillion row challenge, interesting read! I'm looking at queries and indexes next and I'm hoping to include Clickhouse in that comparison.

▲valyala 11 days ago

Could you also include VictoriaMetrics into the comparison?

▲hyperman1 13 days ago

Two remarks with postgres and lots of data:

1) I always wonder of there is a better way than COPY. I tend to quickly get 100% CPU without saturating I/O

2) The row overhead seems big. A row has 23 bytes overhead, this table has 48 bytes data per row, so even without page overhead, we lose ~1/3 of our storage. This is pure data storage, without any index.

▲PolarizedPoutin 13 days ago

1. Yeah to me it seems very hard to saturate I/O with Postgres unless maybe you insert into an unlogged table. I guess there's quite a bit of overhead to get all the nice stability/consistency and crash-resistance.

2. That is a good point. I'm hoping TimescaleDB's compression helps here but yeah I don't think you'll ever get the database size below the data's original footprint.

▲feike 13 days ago

Timescaler here, if you configure the timescaledb.compress_segmentby well, and the data suits the compression, you can achieve 20x or more compression.

(On some metrics data internally, I have 98% reduction in size of the data).

One of the reasons this works is due to only having to pay the per-tuple overhead once per grouped row, which could be as much as a 1000 rows.

The other is the compression algorithm, which can be TimescaleDB or plain PostgreSQL TOAST

https://www.timescale.com/blog/time-series-compression-algor... https://www.postgresql.org/docs/current/storage-toast.html

▲hyperman1 13 days ago

If I look into perf, it seems mostly parsing overhead. I can saturate a create newtable as select from oldtable. Unfortunately, CSV seems still the lingua franca for transport between DBs. Maybe some day a more binary oriented transport protocol will appear( e.g parquet?)

▲feike 13 days ago

Many libraries for python, Rust, golang support COPY BINARY.

The times I've tested it, the improvement is very small as compared to plain copy, or copy with CSV, whereas it does require more work and thought upfront to ensure the binary actually works correctly.

https://www.postgresql.org/docs/current/sql-copy.html

▲to11mtm 13 days ago

As someone who used to do some GIS hacking in an office job[0] before I was a 'Software Developer/engineer' this is super cool.

[0] - Honestly some of the coolest stuff I ever got to do in it's own right. Building tools that could move data between AutoCAD, Microstation, and Google Earth while also importing other bits with metadata from Trimble units[1]. Also it was probably the most I ever used math in my entire career [2], so there's that.

[1] - I wound up finding a custom font maker, and one of our folks made a font library with the symbols, made it easy to write a parser too :D

[2] - One of those PDFs I always seem to wind up having floating on a hard drive is the USGS 'Map Projections, a working manual'. At one point I used it as a reference to implement a C# library to handle transforms between coordinate systems... alas it was internal.

▲PolarizedPoutin 12 days ago

Thank you! I've relied on that USGS Projects manual multiple times haha. Working with satellite data some of it was in somewhat obscure projections and the manual always told you how to convert back to latitude-longitude.

▲semiquaver 13 days ago

Any idea why hypertable insert rates were slower? I though hypertables were supposed to _increase_ insert rates?

▲perrygeo 13 days ago

Hypertable insert rates are faster and more predictable over time. Each individual insert might incur a small bit of extra overhead, but they scale forever since each temporal chunk is indexed separately vs a regular table where the entire index needs to fit in memory. This is a case where you can't make meaningful inferences from micro-benchmarks (they tested 20k rows, you probably need 200M to start seeing the diff)

▲PolarizedPoutin 13 days ago

Thanks for the insight! It is true that I started with a micro-benchmark of 20k rows for slower inserts, but I also did some longer benchmarks with ~772 million rows.

▲ 13 days ago

▲leros 13 days ago

Isn't data inserted into basically a normal Postgres table with hypertable extensions? I don't know the details of Timescale but that sounds like it would incur a cost of a normal Postgres insert, plus potentially extra work at insert time, plus extra work in the background to manage the hypertable.

▲rahkiin 13 days ago

Not entirely. A hypertable is a postgres table chunked over time. There is the assumption that most data and queries are time-relevant, but also that older data is less relevant than new data.

Indexes are per chunk. So if the query analyzer understands you only touch 2023 it can omit looking at any chunk that is from other years and keep those out of memory. Same with the indexes.

▲PolarizedPoutin 13 days ago

Yeah I'm curious about this too. Been meaning to ask on the Timescale forums. My only guess is that there's some small extra overhead due to hypertable chunking.

I know Timescale has a blog post from 2017 claiming a 20x higher insert rate but that's for inserting into a table with an index. The general wisdom for loading huge amounts of data seems to be that you should insert into a table with no indexes then build them later though. So with no index, inserting into a hypertable seems a bit slower.

Timescale blog post: https://medium.com/timescale/timescaledb-vs-6a696248104e

▲h4kor 13 days ago

Timescale hypertables automatically have an index on the timestamp. To make this more comparable you could create the same index on the normal table and test the ingestion rate.

▲tmiku 13 days ago

> I think it would be cool to have historical weather data from around the world to analyze for signals of climate change we’ve already had rather than think about potential future change.

This is a very good instinct! A pretty major portion of modern climate science is paleoclimatology, with a goal of reaching far beyond reliable human measurements. A lot of earth's previous climate states were wildly different from the range of conditions we have experienced in the past 10,000 years, and a better climate record is essential to predicting the effects of massive carbon emission.

Ice cores from Antarctica/Greenland are the most famous instance of this, but there's a lot of other cool ones - there are chemical records of climate change in cave stalactites, ocean floor sediments, etc.

▲PolarizedPoutin 12 days ago

Thank you for mentioning paleoclimatology! Not sure if what I'm doing with ERA5 data is that rigorous haha, but one of my favorite plots is https://commons.wikimedia.org/wiki/File:All_palaeotemps.svg (still have it hung up in my office haha)

▲randrus 13 days ago

Somewhat relevant (which I posted a while back):

https://climate.metoffice.cloud/

▲sammy2255 13 days ago

Clickhouse will eat this for breakfast. And has built-in compression even at the column level

▲PolarizedPoutin 12 days ago

Hoping to find out when I compare TimescaleDB vs. Clickhouse!

▲valyala 11 days ago

Very well written article! I remember the time when I was tuning Postgresql for achieving the maximum possible ingestion speed for ad analytics data. I end up with the following recommendations - https://gist.github.com/valyala/ae3cbfa4104f1a022a2af9b8656b... .

After that I discovered ClickHouse and forgot about all this performance tuning, since ClickHouse could ingest the same as analytics data at 10x higher speed on the same hardware with default settings. On top of this, it could perform analytical queries by up to 1000 times faster than Postgresql on the same hardware. And, as a bonus, disk space usage dropped by more than 10 times compared to Postgresql.

My prediction: if the ingestion of 750 billion rows takes 20 days on highly tuned TimescaleDB and Postgresql, it will take less than a day on Clickhouse with standard settings on the same hardware.

P.S. I was conducting a similar benchmark on a single-node VictoriaMetrics a few years ago - it was capable ingesting 500 billion of samples in ~2 hours - https://valyala.medium.com/billy-how-victoriametrics-deals-w... . Every sample contains a single measurement, while every row in the original article contains 5 measurements. So, if extrapolating, it would take `(2h * 5)/500*750 = 15h` to ingest 750 billion of rows with weather measurements into a single-node VictoriaMetrics.

▲PolarizedPoutin 11 days ago

Thank you for reading and for the links!

I'm trying out Clickhouse for the next post. Definitely excited for sub 24 hour data loading!

I haven't heard of VictoriaMetrics but that's some impressive performance. Will check it out!

▲tonymet 13 days ago

I encourage people to look into the ERA5 dataset provenance especially when you approach the observations made toward the "pre industrial date" of 1850 .

Remember that modern global surface temperatures are collected by satellites, and the dataset is comingled with recordings observed visually & made by hand using buckets by sailors who were not primarily academic researchers. Segments of high resolution, low noise data (satellites) are mixed with low resolution, low coverage, high noise records (hand records on a boat made surrounding the united kingdom).

My point is to be in awe of the technical aspects of this effort but also keep in mind that we are only making copies of low resolution, noisy manuscripts from sailors 170 years ago.

▲shoyer 13 days ago

ERA5 covers 1940 to present. That's well before the satellite era (and the earlier data absolutely has more quality issues) but there's nothing from 170 years ago.

▲tonymet 13 days ago

Similar noise issues apply. Most of the other surface temp models have to cover 1850

▲relaxing 13 days ago

Ok? And what’s your point in pointing out that?

▲tonymet 13 days ago

The data is noisy so be careful when using it for research. Always account for the provenance of the records when working with "data".

▲relaxing 13 days ago

So like basically every data science effort.

▲tonymet 13 days ago

[flagged]

▲tonymet 13 days ago

one of the project's goals was to load the data and make predictions. The page covered the data loading part, but not the methods and error tolerance in the predictions

▲hendiatris 13 days ago

If you’re going to work with weather data use a columnar database, like BigQuery. If you set things up right your performance will generally be a few seconds for aggregation queries. I setup a data platform like this at my previous company and we were able to vastly outperform our competitors and at a much lower cost.

The great thing about this data is it is generally append only, unless errors are found in earlier data sets. But it’s something that usually only happens once a year if at all.

▲PolarizedPoutin 12 days ago

I'm hoping to compare TimescaleDB and Clickhouse to see how big the difference is for different queries! My impression is that TimescaleDB gives you some columnar features, but maybe Clickhouse is a true columnar database.

▲islandert 13 days ago

If you don't have access to COPY if the postgres instance is managed, I've had a lot of luck with encoding a batch of rows as a JSON string, sending the string as a single query parameter, and using `json_to_recordset` to turn the JSON back into a list of rows in the db.

I haven't compared how this performs compared to using a low-level sql library but it outperforms everything else I've tried in sqlalchemy.

▲twoodfin 13 days ago

When I see suggestions like this (totally reasonable hack!), I do have to wonder what happened to JDBC’s “addBatch()/executeBatch()”, introduced over 25 years ago.

https://docs.oracle.com/javase/8/docs/api/java/sql/PreparedS...

Did modern APIs & protocols simply fail to carry this evolutionary strand forward?

▲robertlagrant 12 days ago

> I think it would be cool to have historical weather data from around the world to analyze for signals of climate change we’ve already had rather than think about potential future change.

This is querying a model rather than real observations, so it may more reflect the model's understanding of the world. Still useful; just worth noting I think.

▲defrost 12 days ago

Fair point, also worth noting is:

    The data is output from a climate model run that is constrained to match weather observations.

    So where we have lots of weather observations, ERA5 should match it closely.

    And where we do not have any weather observations, ERA5 will be physically consistent and should match the climatology, i.e. the simulated weather’s statistics should match reality.

    ERA5 covers the entire globe at 0.25 degree resolution, and stretches back in time to 1940 with hourly resolution.

For the time frame (post 1940) there is likely to be actual weather data points twice a day at almost any point on the globe with cities and or ports that colonial expansion reached.

What I'd like to see in such tools are two things:

* Indications of actual observed data points (location time) .. so that the observed point mesh for a day in 1949 about the globe can be seen, and

* Interpolation for a modern day (with dense observed data points) from a reduced set of sparse points (just the ones, say, from that day in 1949) to get a sense of how the infill from sparse points matches the observed but unused points.

I'm not a climate skeptic in the least, global estimation of a coarse global mean from relatively sparse points is fine in my experience (I've done a lot of exploration geophysics for mineral | energy resource mapping) and a bit of interpolated wiggle in the model doesn't blow the big corase estimation .. but this question of interpolation from sparse data is one that always gets raised by skeptics.

▲PolarizedPoutin 12 days ago

The data is definitely sparse in the earlier decades, especially over unpopulated areas and in developing nations, due to the lack of regularly recorded weather observations (sensors, balloons, etc.). Especially since weather satellites started collecting global data, coverage has been better though. And weather observing stations have proliferated since too.

Totally agree that it would be cool to see when and where the weather observations are that are informing the climate model. A quick search didn't turn up this dataset, but it may be somewhere. I'll see if I can find this data.

▲PolarizedPoutin 12 days ago

Yes you are correct of course. ERA5 is climate model output constrained to match weather observations, not actual observations, which I do note in the post.

▲ 13 days ago

▲user3939382 12 days ago

> 727,080 snapshots in time for each variable like temperature

Maybe we should be recording and transmitting the parameters for a function or set of functions that describes these temperature etc changes rather than a set of points to drastically reduce the size of these data sets.

▲PolarizedPoutin 12 days ago

Are you suggesting compressing the data using like Fourier series or wavelets? I know some of the variables have sharp gradients, especially variables like precipitation, which probably wouldn't compress super well. And some applications of the ERA5 dataset, e.g. closing heat or moisture budgets by accounting for every bit of it, cannot be done if the data has been compressed since you lose some data. Curious if this is what you had in mind or something else.

▲koliber 13 days ago

Curious if he could squeeze more performance by using a different structure to store the same data. Some of these float4 cols could probably be stored as int2. Depending on how many decimal places are needed, can divide the int to get the resulting floating point value.

▲PolarizedPoutin 12 days ago

Yeah I was thinking about this and hoped that Postgres had a `float2` data type but `int2` would have to work. I could scale the numbers to fit them into 2 bytes with minimal loss of precision, but decided I'd rather take the storage space hit since TimescaleDB promises good compression. Still haven't measured this though haha.

▲curious_cat_163 13 days ago

> The data is output from a climate model run that is constrained to match weather observations.

That's interesting. Why store it? Why not compute it as needed using the model?

FWIW, I am not an expert in this space and if someone is, it would be good to understand it.

▲PolarizedPoutin 12 days ago

It's a good question! It's true that the output is massive, I believe ~9 petabytes and growing. But running the model is super expensive. It runs on ECMWF's supercomputer. Not sure how many cores but I would guesstimate in the 10k - 100k core range just for one instance. So computing the data on the fly is super expensive, probably much more expensive than just storing it.

▲dunefox 13 days ago

OT: Does anyone know if DuckDB would be of use here?

▲wiredfool 13 days ago

Not likely, unless it's a set of parquet files already. Clickhouse would be a better bet.

▲smellybigbelly 13 days ago

Can anyone give some advice on how they run TimeScale in Kubernetes? I’m seeing they dropped support for their Helm chart.

▲jamesgresql 13 days ago

Hi, we updated our docs with the best options.

https://docs.timescale.com/self-hosted/latest/install/instal...

I'd personally recommend StackGres, it's a great project.

▲rkwasny 13 days ago

Yeah, don't use TimescaleDB, use ClickHouse - I have 10 years of NOAA climate data on my desktop that I query when I want to go on holidays ;-)

▲mfreed 13 days ago

Our experience is that Clickhouse and Timescale are designed for different workloads, and that Timescale is optimized for many of the time-series workloads people use in production:

- https://www.timescale.com/blog/what-is-clickhouse-how-does-i...

Sidenote: Timescale _does_ provide columnar storage. I don't believe that the blog author focused on this as part of insert benchmarks:

- Timescale columnar storage: https://www.timescale.com/blog/building-columnar-compression...

- Timescale query vectorization: https://www.timescale.com/blog/teaching-postgres-new-tricks-...

▲rkwasny 13 days ago

Well, as a Co-founder and CTO of Timescale, would you say TimescaleDB is a good fit for storing weather data as OP does?

▲mfreed 13 days ago

TimescaleDB primarily serves operational use cases: Developers building products on top of live data, where you are regularly streaming in fresh data, and you often know what many queries look like a priori, because those are powering your live APIs, dashboards, and product experience.

That's different from a data warehouse or many traditional "OLAP" use cases, where you might dump a big dataset statically, and then people will occasionally do ad-hoc queries against it. This is the big weather dataset file sitting on your desktop that you occasionally query while on holidays.

So it's less about "can you store weather data", but what does that use case look like? How are the queries shaped? Are you saving a single dataset for ad-hoc queries across the entire dataset, or continuously streaming in new data, and aging out or de-prioritizing old data?

In most of the products we serve, customers are often interested in recent data in a very granular format ("shallow and wide"), or longer historical queries along a well defined axis ("deep and narrow").

For example, this is where the benefits of TimescaleDB's segmented columnar compression emerges. It optimizes for those queries which are very common in your application, e.g., an IoT application that groups by or selected by deviceID, crypto/fintech analysis based on the ticker symbol, product analytics based on tenantID, etc.

If you look at Clickbench, what most of the queries say are: Scan ALL the data in your database, and GROUP BY one of the 100 columns in the web analytics logs.

- https://github.com/ClickHouse/ClickBench/blob/main/clickhous...

There are almost no time-predicates in the benchmark that Clickhouse created, but perhaps that is not surprising given it was designed for ad-hoc weblog analytics at Yandex.

So yes, Timescale serves many products today that use weather data, but has made different choices than Clickhouse (or things like DuckDB, pg_analytics, etc) to serve those more operational use cases.

▲dangoodmanUT 13 days ago

Agreed, clickhouse is faster and has better features for this

▲RyanHamilton 13 days ago

I agree. Clickhouse is awesomely powerful and fast. I maintain a list of benchmarks, if you know of any speed comparisons please let me know and I will add them to the lsit: https://www.timestored.com/data/time-series-database-benchma...

▲lyapunova 13 days ago

Thanks for maintaining benchmarks here. Is there a github repo that might accompany the benchmarks that I could take a look at / reproduce?

▲anentropic 13 days ago

Do you say that because its quicker to insert large batches of rows into Clickhouse, or because it's better in other ways?

(I'm currently inserting large batches of rows into MySQL and curious about Clickhouse...)

▲rkwasny 13 days ago

It's better in insert speed, query speed and used disk storage.

▲gonzo41 13 days ago

So click house is a column db. Any thoughts on if the performance would be a wash if you just pivoted the timescale hypertable and indexed the time + column on timescale?

▲PolarizedPoutin 13 days ago

Haha very cool use! Yeah reading up on TimescaleDB vs. Clickhouse it seems like columnar storage and Clickhouse will be faster and better compress the time series data. For now I'm sticking to TimescaleDB to learn Postgres and PostGIS, but might make a TimescaleDB vs. Clickhouse comparison when I switch!

▲rkwasny 13 days ago

I can replicate your benchmark when I get a moment, the data, is it free to share if I wanted to make an open browser?

I have a feeling general public has very limited access to weather data, and graphs in news that state "are we getting hotter on colder" are all sensational and hard to verify.

▲cevian 13 days ago

Please note that TimescaleDB also uses columnar storage for its compressed data.

Disclosure: I am a TimescaleDB engineer.

▲nojvek 13 days ago

This should be a benchmark.

Could someone post me to where I can download the whole dataset?

▲PolarizedPoutin 12 days ago

The full dataset is quite huge (~9 petabytes and growing) out of which I'm using just ~8 terabytes.

The data is freely available from the Climate Change Service [1] which has a nice API but download speeds can be a bit slow.

NCAR's Research Data Archive [2] provides some of the data (as pre-generated NetCDF files) but at higher download speeds.

It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository [3].

[1]: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...

[2]: https://rda.ucar.edu/datasets/ds633-0/

[3]: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...

▲dcreater 12 days ago

Nice! Can you upload the data to hf, oxen.ai, kaggle or something?

▲PolarizedPoutin 12 days ago

The full dataset is quite huge (~9 petabytes and growing) out of which I'm using just ~8 terabytes. Still quite big to upload.

The data is freely available from the Climate Change Service [1] which has a nice API but download speeds can be a bit slow.

NCAR's Research Data Archive [2] provides some of the data (as pre-generated NetCDF files) but at higher download speeds.

It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository [3].

[1]: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...

[2]: https://rda.ucar.edu/datasets/ds633-0/

[3]: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...

▲Aeroi 13 days ago

What did Google use to train GraphCast?

▲PolarizedPoutin 12 days ago

GraphCast was trained on this exact same data!

From https://deepmind.google/discover/blog/graphcast-ai-model-for...

> Crucially, GraphCast and traditional approaches go hand-in-hand: we trained GraphCast on four decades of weather reanalysis data, from the ECMWF’s ERA5 dataset.

▲sigmonsays 13 days ago

where can I download the weather data?

Is it free or available if I sign up?

▲roter 13 days ago

If I read correctly, they are using the ERA5 dataset [0]. It is freely available and can be downloaded through a Python API called cdsapi.

[0] https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...

▲PolarizedPoutin 12 days ago

The data is publicly available!

The data is freely available from the Climate Change Service [1] which has a nice API but download speeds can be a bit slow. You'll have to sign up for this.

NCAR's Research Data Archive [2] provides some of the data (as pre-generated NetCDF files) but at higher download speeds. No signup necessary.

It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository [3].

[1]: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...

[2]: https://rda.ucar.edu/datasets/ds633-0/

[3]: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...

▲kinj28 13 days ago

It's a tangent.

I am curious if we query the data to give us temperature at a given time for all lat n long and plot it geo spatially , would the result give anything on heat distribution of energy received across the lat and long at that point in time?

▲PolarizedPoutin 12 days ago

What do you mean by the heat distribution of energy? Do you mean like how much heat is received from the sun at a particular location and particular time? If so, then temperature is only the result of this plus other factors. Factors like cloud cover influence how much radiation is received at the surface.

▲Linda231 13 days ago

[dead]

▲_bax 13 days ago

[flagged]

▲sixo 13 days ago

The timeseries graph is too fine-grained to discern any long-term trends in average temperatures. This comes up a lot, these kinds of graphs always give people problems.

A quick google turned up this website which shows the trend in annual means: https://www.meteoblue.com/en/climate-change/durban_south-afr...

▲throwaway918274 13 days ago

the heisenburg climate change principle

▲sixo 13 days ago

No, you misunderstand (probably because you wish that was true). It's a basic conclusion of data visualization principles. Any method that could detect the trendline would—e.g. basic statistics—but "plot every single datapoint in a big cluster" can't, so it doesn't.

▲spi 13 days ago

If climate change were visible at that scale (tiny resolution between 0 and 40 degrees) we'd be all boiled since a while.

Still, you can see signs: the maximum temperature until 1990 or so seems to be around 35 degrees, since then there are several peaks above that value and in 2016 (?) it looks to be 38-39. It's certainly less visible on the peaks in the low, because maybe the absolute lowest scores appear to be in the 1990-2000 decade, but then again, all years in the 2010-2020 decade seem to be slightly higher than the minimum temperature in any other decade.

That said, there is massive downscaling involved in such scale, so I wouldn't be too surprised if some details were just skipped and not visible. I wouldn't trust this interpretation much - if a visualization it needs to be, I'd rather plot a moving average with a window of 6 months at least (or even 1 year to entirely rule seasonalities out), and see if that one has an upward trend or not (I bet it does).

[EDIT] I now see the post below with the year averages since 1979. It does indeed seem that 1995-1997 were abnormally cold years, and also that 2010-2020 is the warmest decade since then (and likely since quite a bit longer). So the outliers analysis here above seem to stand :-)

▲dfraser992 13 days ago

Tech lead for WEMC here - see https://tealtool.earth Straightforward charts of climate related data for different countries and regions around the globe

For temperature and a few other variables, it shows historical data from the EU Copernicus service (C3S) along with three different projected series out to 2100

for CO2, it shows the latest historical data

The charts are concerning and I am sure my co-workers are not hell bent on faking data to scare people just to get more funding; they work too much and go to too many meetings.

▲_bax 13 days ago

Too small signs in a very noisy data to permit to give panic to the world & people

▲PolarizedPoutin 13 days ago

I have not analyzed any data yet and the purpose of plotting the time series was to show an example of the data as a function of time. As others have already mentioned, the swing in Durban temperatures over the seasonal cycle is ~25°C while global temperature increases due to climate change so far are on the order of 1°C.

Plus weather data tends to be quite noisy, just think how variable the weather can be day-to-day and we're squishing 80 years of that into one plot. Also worth noting that different places may experience climate change differently. Some places may be the average temperature go up, some maybe only in the summer, so you'll have to look at averages. Some places may see more extreme summer highs, so then you can't just look at averages but the average extremes or the tail end of the temperature distribution.

So it'll be hard to discern any climate change from just a cursory glance. I'm not saying it's there, just that it requires more analysis.

▲bo0tzz 13 days ago

Have you read any climate science? These are not the (only) numbers that the knowledge is based on.

▲_bax 13 days ago

I am an engineer, I read a lot and numbers are numbers, not a religion.

▲bigbillheck 13 days ago

Huh, I wonder if climatologists might have based their analyses on more than just this single time series. No way of knowing.

▲jefb 13 days ago

This must be what "doing your own research" looks like. Maybe try here instead of squinting at single time series of Durban: https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6...

▲_bax 13 days ago

Why trust IPCC ?

▲royjacobs 13 days ago

Why not trust experts in their field?

▲_bax 13 days ago

conflict of interest

▲royjacobs 13 days ago

Right, they got all of them, did they?

▲TickleSteve 13 days ago

You're trying to see a small change over a large fuzzy area. You need to look at the deviation from the mean (or some other reference), maybe filtered to remove the seasonal changes.

▲_bax 13 days ago

I know what I am seeing. I know how is made a graph like that. 80 years of noisy data in a small row. But it's enough to lowering the panic speech about C.C.

▲jhoechtl 13 days ago

For the German speaking among us: That should be 1 000 000 000 000 rows

Million - million

Milliarde - billion

Billion - trillion

▲aftbit 13 days ago

Yes, this article is actually loading (order of magnitude) 10^12 rows.

I was very surprised to learn about this the first time. In USA English:

10^3 -> thousand

10^6 -> million

10^9 -> billion

10^12 -> trillion

10^15 -> quadrillion

10^18 -> quintillion

But there's another scale too! See this Wiki article for more

https://en.wikipedia.org/wiki/Names_of_large_numbers

▲samatman 13 days ago

Unless you're reading historical documents, these words have only one meaning in the English language. It is in no sense American English specific.

One advantage of the short scale is that its cadence is the same as that of SI units: million is mega, billion is giga, trillion is tera, and so on. The long scale of course has the -ard names, so this isn't exactly a problem, any more than French saying "four twenties and nineteen" for ninety-nine is a problem.

The difference is one of the major reasons the SI scale names exist in the first place, in fact.

It also matches the decimal separators used by everyone but Indians, which strikes me as more logical.

▲aftbit 13 days ago

Sorry, I wasn't intending to imply that they were American English specific, just that I only have experience with English as spoken in the USA.

What's the advantage of the long scale? Just historical?

▲samatman 13 days ago

I don't think the long scale actually has advantages compared to the short. One could argue euphony: the alternative repetition of million, milliard, billion, billiard, and so on, is pleasing in a way.

But really, each is just a quirk of languages. I don't expect 7 to sound like "seven" in other languages, why expect billion to sound like "billion" rather than "milliard" or "trillion"? When conveying information across languages, we use numerals, sometimes with the SI scales, which are universal.

Just another confusing faux amis to figure out when learning another language, really.

▲jhoechtl 12 days ago

The downvoters stand out as those who don't know. Thank you, amis!

▲rjmunro 13 days ago

This annoys me:

> You can use parameterized queries to protect against SQL injection,

No, that't not what parameterized queries are for. That's just a side benefit.

Parameterized queries are so that the database doesn't have to parse the data from SQL to do the query. It's stupid to turn all the data to text, send the text over the wire, escaped and quoted as appropriate, then parse it back to whatever form it was in originally when you can just send it binary straight to the database.

And if you are doing many similar queries with different results, e.g. inserts as here, or maybe querying the user table by user id every page load, the database doesn't have to parse any SQL each time and can even reuse the query plan.

This may be why psycopg3 performed better than pandas df.to_sql() function in the single row insert case.

▲davedx 13 days ago

I disagree that the security is a side benefit. It’s why most people choose to use parameterized queries.

▲PolarizedPoutin 13 days ago

Thank you for reading through it thoroughly and pointing this out! I'm still new and learning Postgres so this is good to know. I will update the post.

▲munk-a 13 days ago

I would note that I think the above is a rather minority opinion - while parameterized queries are great for reuse simply using them to guard against SQL injection is still an excellent use of them. If your query is reusable then go for it, but most complex queries tend not to be reusable if they involve optional inputs.

▲dboreham 13 days ago

Quick note to observe that all of the above while true becomes less of a practical issue as CPUs become faster vs i/o, which they have done and probably will keep doing.

▲simiones 13 days ago

Isn't the opposite mostly happening, with CPU's single-core performance mostly flat for the last decade, while I/O speeds have been improving every year (especially in networking, not as sure about disk I/O).

▲forrestthewoods 13 days ago

Wait what no stop that what are you talking about.

Transforming data and then transforming it back will always be stupid and costly – in more ways than one.

▲relaxing 13 days ago

Not if you save transit time that can be more quickly made up in processing time.

▲rjmunro 12 days ago

True, but in this case we are converting to strings that are longer because they require quoting and escaping etc. Transit time will be longer.

▲pphysch 13 days ago

Feels like you're a couple decades behind. Single core speeds haven't gone up much in the last decade, especially for "server" CPU models.

▲eddd-ddde 13 days ago

Can it actually reuse the query plan? Couldn't that lead to a worse performing plan eventually? Say after inserting lots of data such that a different plan becomes a better option.

▲rjmunro 13 days ago

I think it's going to be very rare that a database changes enough for a plan to be significantly worse than the optimal plan during the life of a prepared query.

A database engine could easily mitigate against it by re-planning queries every 1000 runs of the query or have the query plan expire after 60 seconds or something. It might be worth trying to re-prepare the query from time to time as part of this kind of bulk insertion benchmarking.

The other thing that would be good to try is closing the transaction every 1000 inserts or some other number. Doing several inserts in a single transaction is certainly better than 1 insert per transaction, but there may come a point where the transaction being too big starts to slow things down.

▲munk-a 13 days ago

Usually the TTLs you'll set on these plans[1] are pretty short. It's helpful if you need to do a burst of a thousand queries of the same form over a short interval.

1. At least effectively - this is usually manged via persisting handles or plans themselves.

2. Which is usually a bad code smell from ORM overuse. Every time you dispatch a query you're paying latency overhead - so in an ideal world your report delivery will involve a single query against the database per report request (whether this is a website and those requests are page loads or you're running some internal system for business reporting).