When Compiler Optimizations Hurt Performance(nemanjatrifunovic.substack.com)
44 points by rbanffy 6 days ago | 5 comments
jandrewrogers 2 hours ago
I’ve seen many examples of a similar phenomenon. Modern compilers are impressively good at generating bit-twiddling micro-optimizations from idiomatic code. They are also good at larger scale structural macro-optimization. However, there is a No Man’s Land for compiler optimization between micro-optimization and macro-optimization where the effectiveness of compiler optimizations are much less reliable.

Intuitively I understand why it is a hard problem. Micro-optimizations have deterministic properties that are simple enough that optimality is all but certain. Macro-optimization heuristics create incidental minor pessimization effects that are buried below the noise floor by major optimization effects on average.

In the middle are optimizations that are too complex to guarantee effective optimization but too small in effect to offset any incidental pessimization. It is easy to inadvertently make things worse in these cases. Most cases of surprising performance variances I see fall into this gap.

It is also where the codegen from different compilers seems to disagree the most, which lends evidence to the idea that the “correct” codegen is far from obvious to a compiler.

algo_trader 56 minutes ago
> also good at larger scale structural macro-optimization

Can u give some examples?

Perhaps you mean small granular choices which occur widely throughout the code?

viraptor 30 minutes ago
Automatic unrolling + inlining + constant hoisting (+ vectorisation)? I'd call that macro, but maybe op had something different in mind.
president_zippy 2 hours ago
If you wanna really see this at work on a whole other extreme, try compiling code for an N64 game. It's no surprise that optimizations for modern-day x86_64 and Arm64 processors with a lot of cache would not generalize well to a MIPS CPU with a cache that must be manipulated at the software layer and abysmal RDRAM latency, but the exact mechanics of it are interesting.

KazeEmmanuar did a great job analyzing exactly this so we don't have to!

https://www.youtube.com/watch?v=Ca1hHC2EctY

1 hour ago
userbinator 2 hours ago
ARM64 has many registers but I believe the lack of suitably large immediate values and, apparently, compilers that are willing to use them all across functions, puts it at a disadvantage here. Assuming we want the return value in eax and the leading count comes in cl, this can be done branchlessly and data-lessly on x86 as follows:

    mov eax, 0x00043201
    test cl, 8
    setz al
    shl cl, 2
    shr eax, cl
    and eax, 15
Something similar may be possible on ARM64, but I suspect it will definitely be more than 19 bytes ;-)
shiftingleft 1 hour ago
Shouldn't your snippet be using lzcnt? I can't see how this would result in the desired lookup.

for Zen5 rustc creates the following:

  utf8_sequence_length_lookup:
    shl edi, 24
    mov ecx, 274945
    not edi
    lzcnt eax, edi
    shl al, 2
    shrx rax, rcx, rax
    and al, 7
    ret
https://rust.godbolt.org/z/hz1eKjnaG
1 hour ago
Panzerschrek 2 hours ago
I think some people misunderstand how compiler optimizations work. it's not magic, which makes each piece of code faster, it's just a bunch of deterministic code transformation rules, which usually make code faster (considering large set of use-cases), but it's not proven that they always do so. So, by writing performant code one should always keep this in mind.
Someone 43 minutes ago
There are many optimizations that are guaranteed to make code run faster, but even then, that does not mean you should use them.

There can be competition between optimizations. At a given moment, the preconditions for both optimization O1 and optimization O2 may hold, but if you then apply O1, the precondition for O2 may no longer hold and give versa.

In such cases, the best solution is to pick the better of O1 and O2.

account42 17 minutes ago
One could even say that compiler optimizations are a set of deterministic code transformation rules that happen to make SPEC run faster. But programmers are not wrong to expect better from their compilers.
Panzerschrek 2 hours ago
It's a simple rule of thumb to avoid branching in performance-critical code. Branchless code is pretty predictable, but if branching takes place, one can no longer reason so easy about code performance, since compiler optimizations and CPU branch prediction may be involved.
account42 15 minutes ago
But reality is not that simple. Branchless with unfortunate data dependencies can be a lot worse than predictable branching code.