Fresh Hacker News | Using the TPDE codegen back end in LLVM ORC

▲Using the TPDE codegen back end in LLVM ORC(weliveindetail.github.io)

21 points by weliveindetail 10 hours ago | 2 comments

▲aengelke 4 hours ago

TPDE co-author here. Nice work, this was easier than expected; so we'll have better upstream ORC support soon [1].

The benchmark is suboptimal in multiple ways:

- Multi-threading makes things just slower. When enabling multi-threading, LLJIT clones every module into a new context before compilation, which is much more expensive than compilation. There's also no way to disable this. This causes a ~1.5x (LLVM)/~6.5x (TPDE) slowdown (very rough measurement on my laptop).

- The benchmark compares against the optimizing LLVM back-end, not the unoptimizing back-end (which would be a fairer comparison) (Code: JTMB.setCodeGenOptLevel(CodeGenOptLevel::None);). Additionally, enabling FastISel helps (command line -fast-isel; setting the TargetOption EnableFastISel seems to have no effect). This gives LLVM a 1.6x speedup.

- The benchmark is not really representative, as it causes FastISel fallbacks to SelectionDAG in some very large basic blocks -- i24 occurs rather rarely in real-world code. This is the reason why the speedup from the unoptimizing LLVM back-end is so low. Replacing i24 with i16 gives LLVM another 2.2x speedup. (Hint: to get information on FastISel fallbacks, enable FastISel and pass the command line options "-fast-isel-report-on-fallback -pass-remarks-missed=sdagisel" to LLVM. This is really valuable when optimizing for compile times.)

So we get ~140ms (TPDE), ~730ms (LLVM -O0), or 5.2x improvement. This is nowhere near the 10-20x speedup that TPDE typically achieves. Why? The new bottleneck is JITLink, which is featureful but slow -- profiling indicates that it consumes ~55% of the TPDE "compile time" (so the net compile time speedup is ~10x). TPDE therefore ships its own JIT mapper, which has fewer features but is much faster.

LLVM is really powerful, and despite being not particularly fast, the JIT API makes it extremely difficult to make it not extra-slow, even for LLVM experts.

[1]: https://github.com/tpde2/tpde/commit/29bcf1841c572fcdc75dd61...

▲weliveindetail 2 hours ago

Please note that the post didn't mention the word benchmark a single time ;) It does a "basic performance measurement" of "our csmith example". Anyway, thanks for your notes, they are very welcome and valid.

Comparing TPDE against the default optimization level in ORC is not fair (because that is -O2 indeed), but that's what we get off-the-shelf. I tested the explicit FastISel setting and it didn't help on the LLVM side, as you said. I didn't try the command-line option though, thanks for the tip! (Especially the -pass-remarks-missed will be useful.)

And yeah, csmith doesn't really generate representative code, but again that was not stated either. I didn't dive into JITLink as it would be a whole post on its own, but yes feature-completeness prevailed over performance here as well -- seems characteristic for LLVM and isn't soo surprising :)

Last but not least, yes multi-threading isn't working as good as the post indicates. This seems related to the fix that JuliaLang did for the TaskDispatcher [1]. I will correct this in the post and see which other points can be addressed in the repo.

Looking forward for your OrcCompileLayer in TPDE!

[1] https://github.com/JuliaLang/julia/pull/58950

▲jasonjmcghee 5 hours ago

It's really exciting to continue to see these developments.

Would love to hear your perspective on cranelift too.