One thing you quickly realize with geospatial data is that it's incredibly nuanced. You have to be quite careful about understanding which coordinate reference system (CRS) and for visualization which projection is being used. The CRS is somewhat paranoia inducing if you don't have great infrastructure setup with the right tools to carry that metadata with your geospatial data.
I've tested everything AWS has to offer, Postgres/PostGIS, Spark/DB, Snowflake, Trini, and ARCGis.
I'm convinced the best tool for large scale geospatial work is Google BigQuery and it's not even close. It took an expensive multi hour query running on PostGIS deployed on an enormous m6a EC2 instance to less than 5 seconds that ran in the BigQuery free tier. It does make sense if you think about it, Google was early with Maps, but it is just stunning how much better they are in this specific niche domain.
This was using publicly available FEMA data that Snowflake and AWS services would just choke on, because the geometry column exceeded their maximum byte size. Spark doesn't even have geospatial data types and the open source packages that add support leave a lot to be desired.
This guy is running on-prem, so maybe this made sense, but I just would never bother. The storage for BQ would probably be less than $100/months for 20 TB.
System is a Mac Studio with 128GB + Asahi Linux + mmapped parquet files and DuckDB, it also runs airflow for us and with Nix can be used to accelerate developer builds and run the airflow tasks for the data team.
GCP is nice when it is free/cheap but they keep tabs on what you are doing and may surprise you at any point in time with ever higher bills without higher usage.
I would love it if somehow Postgres got duckdb powered columnstore tables.
I know hydra.so is doing columnstores.
DuckDB being able to query parquet files directly is a big win IMO.
I wish we could bulk insert parquet files into stock PG.
And sooner or later (usually sooner) somebody will build a fancy Grafana dashboard and set it to refresh every 5 second and you will not notice it until it's too late.
My org would never allow that as we're in a highly regulated and security conscious space.
Totally agree about the BQ costs. The free tier is great and I think pretty generous, but if you're not very careful with enforcing table creation only with partitioning and clustering as much as possible, and don't enforce some training for devs on how to deal with columnar DB's if they're not familiar, the bills can get pretty crazy quickly.
Curious to see if there were other reasons.
What? really? Do they change your pricing plan? How can they charge more for the same usage?
I kept getting emails about small changes and the bills got bigger all over the place including BigQuery and how they dealt with queries on public datasets. Bill got higher.
There is a non zero chance I conflated things. But from my point of view: I created a system and let it running for years - afterwards bills got higher out of the blue and I moved out.
Could you say more about this? I'm curious if you've compared Apache Sedona [0] and what specifically you found lacking? I currently work at Wherobots [1], founded by the creators of Apache Sedona and would love to hear any feedback.
I constantly see folks finding out the hard way that PostGIS is really hard to beat. The fact that Trini/Presto and Spark have languished here is particularly telling.
My org has a very large AWS spend and we got to have a chat with some of their SWE's that work on the geospatial processing features for Redshift and Athena. We described what we needed and they said our only option was to aggregate the data first or drop the offending rows. Obviously we're not interested in compromising our work just to use a specific tool, so we opted for better tools.
The crux of the issue was that the large problem column was the geometry itself. Specifically, MultiPolygon. You need to use the geometry datatype for this[1]. However, our MultiPolygon column was 10's to 100's of MB's. Well outside the max size for the Super datatype from what I can tell as it looks like that's 16 MB.
[1]: https://docs.aws.amazon.com/redshift/latest/dg/GeometryType-...
I much prefer postgres as a tool I can run myself but the order of magnitude performance benefits are hard to argue with.
I wonder if you think a longterm project is better rolling their own. The second priority is that I expect it all to be locally hosted.
Thanks for any considerations.
This is not something I'm productionizing (at least not yet?) and I'm giving myself zero budget since it's a side project, but if/when I do I'll definitely look into BigQuery!
https://cloud.google.com/bigquery/docs/geospatial-data#parti...
If that's not a large node for you, well you're living in a different world from me. Yes to using EBS.
You would just have to setup replication for backups, but this could just be rsync to EBS or some other replication solution depending on your database.
Then there’s the x-family, which can go up to bonkers levels, like 4 TiB of RAM (and local NVMe).
As a sibling comment mentioned, though, if you can fit the data into local storage, that’s the way to do it. EBS latency, even with io2.blockexpress, simply cannot compete.
That said, if I did the math correctly based on the 71 GB/31 days footnote, you’re looking at about 2.15 PB to load the entire thing, so, uh, good luck.
At my last job, I switched a large-ish (100K QPS at peak) m5 MySQL instance to a smaller (in AWS numbering) r6i that was cheaper, despite having the same amount of RAM and being generation newer. That, combined with careful tuning and testing, resulted in queries speeding up 20-40%, AND we then had plenty of room for vertical scaling again if necessary.
EDIT: only newer C instances have comparable uplink to EBS, C5 or C4 (and some C6) starts from ~4.7gbps. just compare the EBS bandwidth column in https://aws.amazon.com/ec2/instance-types/
This rarely matters until it does, like if you’re setting up an active-active DB across regions. Then it’s a fun surprise.
My initial reaction is that I think one of the reasons you're seeing a hypertable being slower is almost certainly that it creates an index on the timestamp column by default. You don't have an index on your standard table which lets it go faster.
You can use create_hypertable with create_default_indexes=>false to skip creating the index, or you can just drop the index before you ingest data. You'll eventually want that index - but it's best created after ingestion in a one-shot load like this.
I'd also be interested in how the HDD you're reading data from is holding up in some of the highly parallel setups?
Ah I did not know about the `create_default_indexes=>false` and that a time index is created by default for hypertables. I'll add a note to explain this! Also curious to benchmark inserting without the time index then creating it manually.
Even with 32 workers I think the HDD was fine. I did monitor disk usage through btop and the SSD that Postgres lived on seemed to be more of a bottleneck than the HDD. So my conclusion was that a faster SSD for Postgres would be a better investment than moving the data from HDD to SSD.
Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!
The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.
[1]: https://cloud.google.com/storage/docs/public-datasets/era5
But as rabernat pointed out in his comment, pulling out a long time series from the cloud replica is also slow. And I know I eventually want to perform complex spatio-temporal queries, e.g. computing the 99% percentile of summer temperatures in Chile from 1940-1980.
I don't doubt that a cloud replica can be faster, but it's at odds with my budget of $0 haha.
Also, I've found the diurnal profile from ERA5 analysis can be abysmal in some locations. ERA5-Land is much better, high resolution, though only available over... er... land.
To your point about not relying on cloud. Noted in the Google option [1] link above:
> Update Frequency: The ERA5 dataset is currently not refreshed in the Google Cloud Public Dataset Program. The program provides ERA5 data spanning from 1940 to May 2023.
Another alternative, Amazon [2], also deprecated:
> The provider of this dataset will no longer maintain this dataset. We are open to talking with anyone else who might be willing to provide this dataset to the community.
I'll have to try the binning and compare it with listing all the values and finding the 99th percentile! But yeah the spatial part of that query might be the tougher part.
And yeah I also considered whether to use ERA5 or ERA5-Land. I don't remember if ERA5-Land was missing a variable or I just wanted to look at some ocean grid cells, but I decided to start with vanilla ERA5.
This highlights the needs for timeseries-optimized chunking if that is your anticipated usage pattern.
Under the hood Open-Meteo is using a custom file format with time-series chunking and specialised compression for low-frequency weather data. General purpose time-series databases do not even get close to this setup.
I think what's missing here is an analysis of what is gained by moving the weather data into a RDBMS. The motivation is to speed up queries. But what's the baseline?
As someone very familiar with this tech landscape (maintainer of Xarray and Zarr, founder of https://earthmover.io/), I know that serverless solutions + object storage can deliver very low latency performance (sub second) for timeseries queries on weather data--much faster than the 30 minutes cited here--_if_ the data are chunked appropriately in Zarr. Given the difficulty of data ingestion described in this post, it's worth seriously evaluating those solutions before going down the RDBMS path.
I agree that storing the data is appropriately chunked Zarr files is almost surely going to be faster, simpler to set up, and take up less space. Could even put up an API in front of it to get "queries".
I also agree that I haven't motivated the RDBMS approach much. This is mainly because I took this approach with Postgres + Timescale since I wanted to learn to work with them, and playing around with ERA5 data seemed like the most fun way. Maybe it's the allure of weather data being big enough to pose a challenge here.
I don't have anything to back this up but I wonder if the RDBMS approach, with properly tuned and indexed TimescaleDB + PostGIS (non-trivial to set up), can speed up complex spatio-temporal queries, e.g. computing the 99th percentile of summer temperatures in Chile from 1940-1980, in case many different Zarr chunks have to be read to find this data. I like the idea of setting up different tables to cache these kinds of statistics, but it's not that hard to do with Zarr either.
I'm benchmarking queries and indexes next so I might know more then!
There seems to be an inherent pull between large chunks (great for visualising large extents and larger queries) vs smaller chunks for point-based or timeseries queries. It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets. I have heard of "kerchunk" being used to try and get the best of both, but then I _think_ you lose out on the option of compressing the data and it introduces quite a lot of complexity.
What are your thoughts on how to strike that balance between use cases?
Like all things in tech, it's about tradeoffs. S3 storage costs about $275 TB a year. Typical weather datasets are ~10 TB. If you're running a business that uses weather data in operations to make money, you could easily afford to make 2-3 copies that are optimized for different query patterns. We see many teams doing this today in production. That's still much cheaper (and more flexible) than putting the same volume of data in a RDBMS, given the relative cost of S3 vs. persistent disks.
The real hidden costs of all of these solutions is the developer time operating the data pipelines for the transformation.
One suggestion aimed at the author here: some of your experiments are taking out WAL writing in a sort of indirect way, using pg_bulkload and COPY. There's one thing you could try that wasn't documented yet when my buddy Craig Ringer wrote the SO post you linked to: you can just turn off the WAL in the configuration. Yes, you will lose the tables in progress if there's a crash, and when things run for weeks those happen. With time scale data, it's not hard to structure the loading so you'll only lose the last chunk of work when that happens. WAL data isn't really necessary for bulk loading. Crash, clean up the right edge of the loaded data, start back up.
Here's the full set of postgresql.conf settings I run to disable the WAL and other overhead:
wal_level = minimal max_wal_senders = 0 synchronous_commit = off fsync = off full_page_writes = off autovacuum = off checkpoint_timeout = 60min
Finally, when loading in big chunks, to keep the vacuum work down I'd normally turn off autovac as above then issue periodic VACUUM FREEZE commands running behind the currently loading date partition. (Talking normal PG here) That skips some work of the intermediate step the database normally frets about where new transactions are written but not visible to everyone yet.
Also glad to hear an expert say that WAL data isn't really necessary for bulk loading, especially with chunks. I should get through the ~20 days it takes to load data without a power outage haha (no UPS yet :() but it sounds like even in the worst case I can just resume.
Short version is GIS indexes are notably smaller and build faster in PG15 than earlier versions. It's a major version to version PG improvement for these workloads.
https://rdrn.me/optimising-sql/
Similar sequence of investigations, but using regular Postgres rather than Timescale. With my setup I got another ~3x speedup over COPY by copying binary data directly (assuming your data is already in memory).
footnote: https://aliramadhan.me/2024/03/31/trillion-rows.html#fn:copy...
[0] https://rdrn.me/optimising-sampling/#round-10-off-the-deep-e... [1] In the original blog I linked.
Love this. It’s the exact opposite of all the other ‘well actually’ mainstream tech posts and I am here for all of it. Props for keeping the reader fully engaged on the journey.
> At a sustained ~462k inserts per second, we’re waiting ~20 days for our ~754 billion rows which is not bad I guess It’s less time than it took me to write this post.
Hah, as I've been gravitating more to writing larger and more in depth blog posts I can relate to the surprising effort it can require.
I'm thinking perhaps you could add support for Timeplus [1]? Timeplus is a streaming-first database built on ClickHouse. The core DB engine Timeplus Proton is open source [2].
It seems that qStudio is open source [3] and written in Java and will need a JDBC driver to add support for a new RDBMS? If yes, Timeplus Proton has an open source JDBC driver [4] based on ClickHouse's driver but with modifications added for streaming use cases.
2: https://github.com/timeplus-io/proton
What I do is download each netCDF file, transpose, and insert into a massive 3D HDF file organized as [lat, lon, hour]. On my workstation it takes about 30 minutes to create one year for one variable (no parallel I/O or processes) but then takes milliseconds to pull a single (lat, lon) location. Initial pain for long-term gain. Simplistic, but I'm just a climatologist not a database guru.
This is a good dataset though and the level of detail in the post is appreciated. I'll give ClicKHouse a go on the same...
Disclaimer: I work for ClickHouse
1) I always wonder of there is a better way than COPY. I tend to quickly get 100% CPU without saturating I/O
2) The row overhead seems big. A row has 23 bytes overhead, this table has 48 bytes data per row, so even without page overhead, we lose ~1/3 of our storage. This is pure data storage, without any index.
2. That is a good point. I'm hoping TimescaleDB's compression helps here but yeah I don't think you'll ever get the database size below the data's original footprint.
(On some metrics data internally, I have 98% reduction in size of the data).
One of the reasons this works is due to only having to pay the per-tuple overhead once per grouped row, which could be as much as a 1000 rows.
The other is the compression algorithm, which can be TimescaleDB or plain PostgreSQL TOAST
https://www.timescale.com/blog/time-series-compression-algor... https://www.postgresql.org/docs/current/storage-toast.html
The times I've tested it, the improvement is very small as compared to plain copy, or copy with CSV, whereas it does require more work and thought upfront to ensure the binary actually works correctly.
[0] - Honestly some of the coolest stuff I ever got to do in it's own right. Building tools that could move data between AutoCAD, Microstation, and Google Earth while also importing other bits with metadata from Trimble units[1]. Also it was probably the most I ever used math in my entire career [2], so there's that.
[1] - I wound up finding a custom font maker, and one of our folks made a font library with the symbols, made it easy to write a parser too :D
[2] - One of those PDFs I always seem to wind up having floating on a hard drive is the USGS 'Map Projections, a working manual'. At one point I used it as a reference to implement a C# library to handle transforms between coordinate systems... alas it was internal.
Indexes are per chunk. So if the query analyzer understands you only touch 2023 it can omit looking at any chunk that is from other years and keep those out of memory. Same with the indexes.
I know Timescale has a blog post from 2017 claiming a 20x higher insert rate but that's for inserting into a table with an index. The general wisdom for loading huge amounts of data seems to be that you should insert into a table with no indexes then build them later though. So with no index, inserting into a hypertable seems a bit slower.
Timescale blog post: https://medium.com/timescale/timescaledb-vs-6a696248104e
This is a very good instinct! A pretty major portion of modern climate science is paleoclimatology, with a goal of reaching far beyond reliable human measurements. A lot of earth's previous climate states were wildly different from the range of conditions we have experienced in the past 10,000 years, and a better climate record is essential to predicting the effects of massive carbon emission.
Ice cores from Antarctica/Greenland are the most famous instance of this, but there's a lot of other cool ones - there are chemical records of climate change in cave stalactites, ocean floor sediments, etc.
After that I discovered ClickHouse and forgot about all this performance tuning, since ClickHouse could ingest the same as analytics data at 10x higher speed on the same hardware with default settings. On top of this, it could perform analytical queries by up to 1000 times faster than Postgresql on the same hardware. And, as a bonus, disk space usage dropped by more than 10 times compared to Postgresql.
My prediction: if the ingestion of 750 billion rows takes 20 days on highly tuned TimescaleDB and Postgresql, it will take less than a day on Clickhouse with standard settings on the same hardware.
P.S. I was conducting a similar benchmark on a single-node VictoriaMetrics a few years ago - it was capable ingesting 500 billion of samples in ~2 hours - https://valyala.medium.com/billy-how-victoriametrics-deals-w... . Every sample contains a single measurement, while every row in the original article contains 5 measurements. So, if extrapolating, it would take `(2h * 5)/500*750 = 15h` to ingest 750 billion of rows with weather measurements into a single-node VictoriaMetrics.
I'm trying out Clickhouse for the next post. Definitely excited for sub 24 hour data loading!
I haven't heard of VictoriaMetrics but that's some impressive performance. Will check it out!
Remember that modern global surface temperatures are collected by satellites, and the dataset is comingled with recordings observed visually & made by hand using buckets by sailors who were not primarily academic researchers. Segments of high resolution, low noise data (satellites) are mixed with low resolution, low coverage, high noise records (hand records on a boat made surrounding the united kingdom).
My point is to be in awe of the technical aspects of this effort but also keep in mind that we are only making copies of low resolution, noisy manuscripts from sailors 170 years ago.
The great thing about this data is it is generally append only, unless errors are found in earlier data sets. But it’s something that usually only happens once a year if at all.
I haven't compared how this performs compared to using a low-level sql library but it outperforms everything else I've tried in sqlalchemy.
https://docs.oracle.com/javase/8/docs/api/java/sql/PreparedS...
Did modern APIs & protocols simply fail to carry this evolutionary strand forward?
This is querying a model rather than real observations, so it may more reflect the model's understanding of the world. Still useful; just worth noting I think.
The data is output from a climate model run that is constrained to match weather observations.
So where we have lots of weather observations, ERA5 should match it closely.
And where we do not have any weather observations, ERA5 will be physically consistent and should match the climatology, i.e. the simulated weather’s statistics should match reality.
ERA5 covers the entire globe at 0.25 degree resolution, and stretches back in time to 1940 with hourly resolution.
For the time frame (post 1940) there is likely to be actual weather data points twice a day at almost any point on the globe with cities and or ports that colonial expansion reached.What I'd like to see in such tools are two things:
* Indications of actual observed data points (location time) .. so that the observed point mesh for a day in 1949 about the globe can be seen, and
* Interpolation for a modern day (with dense observed data points) from a reduced set of sparse points (just the ones, say, from that day in 1949) to get a sense of how the infill from sparse points matches the observed but unused points.
I'm not a climate skeptic in the least, global estimation of a coarse global mean from relatively sparse points is fine in my experience (I've done a lot of exploration geophysics for mineral | energy resource mapping) and a bit of interpolated wiggle in the model doesn't blow the big corase estimation .. but this question of interpolation from sparse data is one that always gets raised by skeptics.
Totally agree that it would be cool to see when and where the weather observations are that are informing the climate model. A quick search didn't turn up this dataset, but it may be somewhere. I'll see if I can find this data.
Maybe we should be recording and transmitting the parameters for a function or set of functions that describes these temperature etc changes rather than a set of points to drastically reduce the size of these data sets.
That's interesting. Why store it? Why not compute it as needed using the model?
FWIW, I am not an expert in this space and if someone is, it would be good to understand it.
https://docs.timescale.com/self-hosted/latest/install/instal...
I'd personally recommend StackGres, it's a great project.
- https://www.timescale.com/blog/what-is-clickhouse-how-does-i...
Sidenote: Timescale _does_ provide columnar storage. I don't believe that the blog author focused on this as part of insert benchmarks:
- Timescale columnar storage: https://www.timescale.com/blog/building-columnar-compression...
- Timescale query vectorization: https://www.timescale.com/blog/teaching-postgres-new-tricks-...
That's different from a data warehouse or many traditional "OLAP" use cases, where you might dump a big dataset statically, and then people will occasionally do ad-hoc queries against it. This is the big weather dataset file sitting on your desktop that you occasionally query while on holidays.
So it's less about "can you store weather data", but what does that use case look like? How are the queries shaped? Are you saving a single dataset for ad-hoc queries across the entire dataset, or continuously streaming in new data, and aging out or de-prioritizing old data?
In most of the products we serve, customers are often interested in recent data in a very granular format ("shallow and wide"), or longer historical queries along a well defined axis ("deep and narrow").
For example, this is where the benefits of TimescaleDB's segmented columnar compression emerges. It optimizes for those queries which are very common in your application, e.g., an IoT application that groups by or selected by deviceID, crypto/fintech analysis based on the ticker symbol, product analytics based on tenantID, etc.
If you look at Clickbench, what most of the queries say are: Scan ALL the data in your database, and GROUP BY one of the 100 columns in the web analytics logs.
- https://github.com/ClickHouse/ClickBench/blob/main/clickhous...
There are almost no time-predicates in the benchmark that Clickhouse created, but perhaps that is not surprising given it was designed for ad-hoc weblog analytics at Yandex.
So yes, Timescale serves many products today that use weather data, but has made different choices than Clickhouse (or things like DuckDB, pg_analytics, etc) to serve those more operational use cases.
(I'm currently inserting large batches of rows into MySQL and curious about Clickhouse...)
I have a feeling general public has very limited access to weather data, and graphs in news that state "are we getting hotter on colder" are all sensational and hard to verify.
Disclosure: I am a TimescaleDB engineer.
Could someone post me to where I can download the whole dataset?
The data is freely available from the Climate Change Service [1] which has a nice API but download speeds can be a bit slow.
NCAR's Research Data Archive [2] provides some of the data (as pre-generated NetCDF files) but at higher download speeds.
It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository [3].
[1]: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...
[2]: https://rda.ucar.edu/datasets/ds633-0/
[3]: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...
The data is freely available from the Climate Change Service [1] which has a nice API but download speeds can be a bit slow.
NCAR's Research Data Archive [2] provides some of the data (as pre-generated NetCDF files) but at higher download speeds.
It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository [3].
[1]: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...
[2]: https://rda.ucar.edu/datasets/ds633-0/
[3]: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...
From https://deepmind.google/discover/blog/graphcast-ai-model-for...
> Crucially, GraphCast and traditional approaches go hand-in-hand: we trained GraphCast on four decades of weather reanalysis data, from the ECMWF’s ERA5 dataset.
Is it free or available if I sign up?
[0] https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...
The data is freely available from the Climate Change Service [1] which has a nice API but download speeds can be a bit slow. You'll have to sign up for this.
NCAR's Research Data Archive [2] provides some of the data (as pre-generated NetCDF files) but at higher download speeds. No signup necessary.
It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository [3].
[1]: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...
[2]: https://rda.ucar.edu/datasets/ds633-0/
[3]: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...
I am curious if we query the data to give us temperature at a given time for all lat n long and plot it geo spatially , would the result give anything on heat distribution of energy received across the lat and long at that point in time?
A quick google turned up this website which shows the trend in annual means: https://www.meteoblue.com/en/climate-change/durban_south-afr...
Still, you can see signs: the maximum temperature until 1990 or so seems to be around 35 degrees, since then there are several peaks above that value and in 2016 (?) it looks to be 38-39. It's certainly less visible on the peaks in the low, because maybe the absolute lowest scores appear to be in the 1990-2000 decade, but then again, all years in the 2010-2020 decade seem to be slightly higher than the minimum temperature in any other decade.
That said, there is massive downscaling involved in such scale, so I wouldn't be too surprised if some details were just skipped and not visible. I wouldn't trust this interpretation much - if a visualization it needs to be, I'd rather plot a moving average with a window of 6 months at least (or even 1 year to entirely rule seasonalities out), and see if that one has an upward trend or not (I bet it does).
[EDIT] I now see the post below with the year averages since 1979. It does indeed seem that 1995-1997 were abnormally cold years, and also that 2010-2020 is the warmest decade since then (and likely since quite a bit longer). So the outliers analysis here above seem to stand :-)
For temperature and a few other variables, it shows historical data from the EU Copernicus service (C3S) along with three different projected series out to 2100
for CO2, it shows the latest historical data
The charts are concerning and I am sure my co-workers are not hell bent on faking data to scare people just to get more funding; they work too much and go to too many meetings.
Plus weather data tends to be quite noisy, just think how variable the weather can be day-to-day and we're squishing 80 years of that into one plot. Also worth noting that different places may experience climate change differently. Some places may be the average temperature go up, some maybe only in the summer, so you'll have to look at averages. Some places may see more extreme summer highs, so then you can't just look at averages but the average extremes or the tail end of the temperature distribution.
So it'll be hard to discern any climate change from just a cursory glance. I'm not saying it's there, just that it requires more analysis.
Million - million
Milliarde - billion
Billion - trillion
I was very surprised to learn about this the first time. In USA English:
10^3 -> thousand
10^6 -> million
10^9 -> billion
10^12 -> trillion
10^15 -> quadrillion
10^18 -> quintillion
But there's another scale too! See this Wiki article for more
One advantage of the short scale is that its cadence is the same as that of SI units: million is mega, billion is giga, trillion is tera, and so on. The long scale of course has the -ard names, so this isn't exactly a problem, any more than French saying "four twenties and nineteen" for ninety-nine is a problem.
The difference is one of the major reasons the SI scale names exist in the first place, in fact.
It also matches the decimal separators used by everyone but Indians, which strikes me as more logical.
What's the advantage of the long scale? Just historical?
But really, each is just a quirk of languages. I don't expect 7 to sound like "seven" in other languages, why expect billion to sound like "billion" rather than "milliard" or "trillion"? When conveying information across languages, we use numerals, sometimes with the SI scales, which are universal.
Just another confusing faux amis to figure out when learning another language, really.
> You can use parameterized queries to protect against SQL injection,
No, that't not what parameterized queries are for. That's just a side benefit.
Parameterized queries are so that the database doesn't have to parse the data from SQL to do the query. It's stupid to turn all the data to text, send the text over the wire, escaped and quoted as appropriate, then parse it back to whatever form it was in originally when you can just send it binary straight to the database.
And if you are doing many similar queries with different results, e.g. inserts as here, or maybe querying the user table by user id every page load, the database doesn't have to parse any SQL each time and can even reuse the query plan.
This may be why psycopg3 performed better than pandas df.to_sql() function in the single row insert case.
Transforming data and then transforming it back will always be stupid and costly – in more ways than one.
A database engine could easily mitigate against it by re-planning queries every 1000 runs of the query or have the query plan expire after 60 seconds or something. It might be worth trying to re-prepare the query from time to time as part of this kind of bulk insertion benchmarking.
The other thing that would be good to try is closing the transaction every 1000 inserts or some other number. Doing several inserts in a single transaction is certainly better than 1 insert per transaction, but there may come a point where the transaction being too big starts to slow things down.
1. At least effectively - this is usually manged via persisting handles or plans themselves.
2. Which is usually a bad code smell from ORM overuse. Every time you dispatch a query you're paying latency overhead - so in an ideal world your report delivery will involve a single query against the database per report request (whether this is a website and those requests are page loads or you're running some internal system for business reporting).