Basically the curl of asset pipelines ;)
// One-byte case for SLEB128 int64_t from_signext(uint64_t v) { return v < 64 ? v - 128 : v; }
// One-byte case for ULEB128 with zig-zag encoding int64_t from_zigzag(uint64_t z) { return (z >> 1) ^ -(z & 1); }
Now why can't compilers do this sort of thing automatically?
Almost any problem seems to be possible to speed up 1000x in AVX512+days of thought compared to the naive version written in a python loop. If we could automate that whole process for big codebases the performance gains could be huge.
I often wonder about a macro-like thing where we could write a function using a subset of the language that’s simd aware. A bit higher level than using intrinsics or those simd libs
One example is Java, which will happily vectorize your code into AVX or SSE where possible.
Python just got a JIT compiler and we’ll start seeing the same thing soon.
But as someone else said here, some constructs don’t translate well and adding transformations to show vectorization would negate the perfomance gains.
Sad that the compiler (even Java) can’t explain you this and warn about it, but now with LLM, maybe they’ll start doing things like that soon.
Zig has MultiArrayList in the stdlib which does the SoA transform via comptime:
https://ziglang.org/documentation/master/std/#std.multi_arra...
Zig also sorts struct members by size/alignment, but has two escape hatches ('extern struct' which is for C compatibility, and 'packed struct' which offers an explicit bit-by-bit memory layout).
AFAIK Odin and Jai offer the SoA transform as specialized language features, e.g. in Odin:
https://odin-lang.org/docs/overview/#soa-data-types
I'd still always want such data layout transforms as an explicit language feature though, not the compiler making this decision for me.
I wonder if Futhark does? Eg https://futhark-lang.org/student-projects/pedersen-nelin-msc...
Out of this 1000x speedup you get 100x by just not using python though ;)
Also IIRC the main problem specifically with AVX512 was that mainstream CPUs simply didn't have it, so a smart compiler won't be of much use when the output code only runs on a handful devices.
They do - they just can't assume GFNI instructions are present unless you explicitly say so: https://godbolt.org/z/eYasbKsse
Because they are not query compilers, ie: They don't know the data.
For example a query compiler could swap index to full scan because it "see" (by runtime statistics) the data not benefit for it.
In the other hand, an optimization here can pessimism there. So optimizers in general should be very conservative because butterfly effects!
a pragmatic approach: write in a high level interpreted language that rhymes with modern CPUs, vector extensions, memory bandwidth
e.g. apl [0], bqn [1], k [2], kiwi [3]
- vectors are dense (not boxed)
- optimized internal representation (e.g. bitpacked bool vectors)
- primitives act on vectors + use avx, neon if possible
[0] https://www.dyalog.com
[1] https://mlochbaum.github.io/BQN/
[2] https://kx.com
[3] https://kiwilang.comgreat article by marshall on BQN performance compared to C and how to think about it
https://mlochbaum.github.io/BQN/implementation/versusc.html
related:
- columnar databases: kdb, duckdb, clickhouse
- machine learning frameworks: pytorch, keras, jax, mlx