I haven't used a non-laptop GPU in some time, but that is a crazy amount of "idle" power consumption. Is this normal for cards like this?
For server gear it’s more common to have less dynamic power and voltage switching because it produces more predictable performance and latency.
Hardware is different. Every operation that can be performed in hardware by a chip needs dedicated circuitry. Special casing 0 and 1 means adding at least OR reduction on each operand and a dedicated multiplexer for every bit of the output. Those transistors use power even when they're not in use (leakage power is a huge issue on modern semiconductor processes). They also degrade timing by adding more gates on critical paths through the multipliers. (The timing issue here is that all operations that happen between one flip-flop and another flip-flop need to finish within one clock cycle.) And unless there are whole blocks of 0's and 1's (this does happen in certain neural networks), you typically won't see a direct speedup anyway. In software terms, the matrix multiply is scheduled as many parallel operations that cannot be accelerated much overall by skipping a few operations in some "threads."
All of this makes zero skipping a nontrivial topic. People do still try to do it but it needs serious consideration as, depending on the application, the case is rarely one-sided.
[0] - https://stackoverflow.com/questions/11227809/why-is-conditio...
GPUs do branch prediction? I thought they didn't bother and try to minimize wasted effort by using high amounts of concurrent threads?
There was a workshop paper from SC24 that did more experiments around this I believe. I can't find it now though.
I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.
Power limiting does not improve performance but it does improve efficiency. You might be able to get 90% of the performance for only 70% of the power usage, for example. It does not make the card go faster though.
This is precicely because of the efficiency. The lower efficiency of the higher speed triggers a much lower performance sooner.
This is not true unless the throttling algorithm is so broken that it's oscillating between extremes.
The parts have a curve of clock speed versus voltage. More clock speed means higher performance. That goes further up the voltage curve, meaning more power.
Throttling just moves the card further down the voltage to clock speed curve. It reduces clock speed, reducing performance.
The cards don't "perform faster by running slower". If you run the card slower, it performs slower.
~257.5 teraflops for normal distribution, versus ~268 teraflops uniform, reported on the first graph.
I would have liked to see a straight graph of performance vs. clock speed, for each type of data. Pick your data statistics, then pick the peak performance clock speed accordingly.
And for actual runs, from a pre-run sampled curve.
https://clehaxze.tw/gemlog/2025/04-21-programming-tensotrren...
https://clehaxze.tw/gemlog/2026/01-22-the-real-tenstorrent-t...