Very interesting. It sounds like tuning at the PTX level can increase workload efficiencies, such as quote "Specifically, we employ customized PTX (Parallel Thread Execution) instructions" from the DeepSeek folks
https://arxiv.org/abs/2412.19437.
It’s really not true anymore that PTX is forward compatible. There’s a subset that is but any of the new interesting interfaces that have been added are not forward compatible and change in each microarchitectural revision. Most of the reason you’d drop down to PTX anyway is to use those; otherwise compilers are fairly good these days and it’s rarely the case you’ll see PTX unless you’re profiling.
Is this analogy valid: Writing PTX is like writing assembly instead of a higher-level language (C, C++, rust etc) for CPU code? E.g. normally the higher level code compiles to it, but you can do optimizations by going lower?
For context, like the opening paragraph in the article goes into, I generate PTX code regularly, but have no idea what the actual code in the PTX file means!
I'm curious about the forward compatibility the article goes into. I only experience that to a point: Code compiled on Cuda 12 does not seem to work on machines with Cuda 13.