(a) You mention that the NVidia docs push people to use libraries, etc. to really get high performance CUDA kernels, rather than writing them themselves. My argument would be that SIMD is exactly the same - they're something really that are perfect if you're writing a BLAS implementation but are too low level for most developers thinking about a problem to make use of.
(b) You show a problem where autovectorisation fails because of branching, and jump straight to intrinsics as the solution which you basically say are ugly. Looking at the intrinsic code, you're using a mask to deal with the branching. But there's a middle ground - almost always you would want to try restructuring the problem, e.g. splitting up loops and adding masks where there's conditions - i.e. lean into the SIMD paradigm. This would also be the same advice in CUDA.
(c) As you've found, GCC actually performs quite poorly for x86-64 optimisations compared to Intel. It's not always clear cut though, the Intel Compiler for e.g. sacrifices IEEE 764 float precision and go down to ~14 digits of precision in it's defaults, because it sets the flag `-fp-model=fast -fma`. This is true of both the legacy and new Intel compiler. If you switch to `-fp-model=strict` then you may find that the results are closer.
(d) AVX512 is quite hardware specific. Some processors execute these instructions much better than others. It's really a collection of extensions, and you get frequency downclocking that's better/worse on different processors as these instructions are executed.
To re-iterate, this is our observation as well. The first AVX512 processors would execute such code quite fast for a short time, then overheat and throttle, leading to a worse wall-time performance than the corresponding AVX256 code.
I am not sure if there is a better way to find the fastest code path besides "measure on the target system", which of course comes with its own challenges.
I just treat autovectorization like I treat every other fancy optimization that's not just constant folding and inlining: nice when it happens to work out, it probably happens to make my code a couple percent faster on average, but I absolutely can't rely on it in any place where I depend on the performance.
I just used to fire up VTune and inspect the hot loops... typically if you care about this you're only really working on hardware targeting the latest instruction sets anyway in my experience. It's only if you're working on low level libraries I would bother doing intrinsics all over the place.
For most consumer software you want to be able to fall back to some lowest-common-denominator hardware anyway otherwise people using it run into issues - same reason that Debian, Conda, etc. only go up to really old instructions sets.
If you wish to see some speedups using AVX512, without limiting yourself to C or C++, you might want to try ISPC (https://ispc.github.io/index.html).
You'll get sane aliasing rules from the perspective of performance, multi-target binaries with dynamic dispatching and a lot more control over the code generated.
See for instance https://github.com/ispc/ispc/pull/2160
Not if the data is small and in cache.
> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.
I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.
The problem is that not all programming languages expose SIMD, and even if they do it is only a portable subset, additionally the kind of skills that are required to be able to use SIMD properly isn't something everyone is confortable doing.
I certainly am not, still managed to get around with MMX and early SSE, can manage shading languages, and that is about it.
See also https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teard...
Use C as a common platform denominator without crazy optimizations (like tcc). If you need performance, specialize, C gives you the tools to call assembly (or use compiler some intrinsic or even inline assembly).
Complex compiler doing crazy optimizations, in my opinion, is not worth it.
For these optimisations that are in the back-end, they are used for other languages that can be higher-level or that cannot drop to assembler as easily. C is just one of the front-ends of modern compiler suites.
Would be interesting to see if auto vec performs better with that addition.
auto aligned_p = std::assume_aligned<16>(p)
Reminded me of way back before OpenGL 2.0, and I was trying to get Vertex Buffer Objects working in my Delphi program using my NVIDIA graphics card. However it kept crashing occasionally, and I just couldn't figure out why.
I've forgotten a lot of the details, but either the exception message didn't make sense or I didn't understand it.
Anyway, after bashing my head for a while I had an epiphany of sorts. NVIDIA liked speed, vertices had to be manipulated before uploading to the GPU, maybe the driver used aligned SIMD instructions and relied on the default alignment of the C memory allocator?
In Delphi the default memory allocator at the time only did 4 byte aligned allocations, and so I searched and found that Microsoft's malloc indeed was default aligned to 16 bytes. However the OpenGL standard and VBO extension didn't say anything about alignment...
Manually aligned the buffers and voila, the crashes stopped. Good times.
No reason for the compiler to balk at vectorizing unaligned data these days.
Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...
A great example would be a convolution-kernel style code - with AVX512 you are using 64 bytes at a time (a whole cacheline), and sampling a +- N element neighborhood around a pixel. By definition most of those reads will be unaligned!
A lot of other great use cases for SIMD don't let you dictate the buffer alignment. If the code is constrained by bandwidth over compute, I have found it to be worth doing a head/body/tail situation where you do one misaligned iteration before doing the bulk of the work in alignment, but honestly for that to be worth it you have to be working almost completely out of L1 cache which is rare... otherwise you're going to be slowed down to L2 or memory speed anyways, at which point the half rate penalty doesn't really matter.
The early SSE-style instructions often favored making two aligned reads and then extracting your sliding window from that, but there's just no point doing that on modern hardware - it will be slower.
There are lots of people using Javascript frameworks to build slow desktop and mobile software.