> it seemed quite natural to use the triple-dot forwarding syntax (...).
> Unfortunately I found that using ... was quite expensive
> This lead me to implement an optimization for ... .
That’s some excellent yak shaving. And speaking up … in any language is good news even if allocation is not faster.
And if so, will these YJIT features likes Fast Allocations be brought to ZJIT?
ZJIT is a method based JIT (the type of compiler traditionally taught in schools) where YJIT is a lazy basic block versioning (LBBV) compiler. We're using what we learned developing and deploying YJIT to build an even better JIT compiler. IOW we're going to fold some of YJIT's techniques in to ZJIT.
> And if so, will these YJIT features likes Fast Allocations be brought to ZJIT?
It may not have been clear from the post, but this fast allocation strategy is actually implemented in the byte code interpreter. You will get a speedup without using any JIT compiler. We've already ported this fast-path to YJIT and are in the midst of implementing it in ZJIT.
I’m so glad to see your work, and it’s always such a treat to read any of your new posts. Hope to see upcoming ones more often!
I can imagine combining BBV type branching info with register tracing would add a lot of complicated structures, but I thought the BB's of BBV was more or less analogous to SSA blocks so no middle ground to be found? (consdering if there are many megamorphic sites in the standard library?)
They're pivoting (successfully?) to a more traditional way, letting the interpreter first profile the code (to figure out the types) and THEN produce entire methods with heavier optimizations that can do better register allocation.
The BBV approach is sane out of the box but kinda unfamiliar for many compiler writers (problems hiring?) and probably has some performance ceilings without much complexity.
The major question as to what method will win out depends on this question, how "monomorphic" or "polymorphic" is Ruby code in real life?
Monomorphic basically means that only one "real type"(from the compilers point of view) will ever pass a codepath (and thus extra machinery to allow multiple types won't bring much benefit).
I guess YJIT will always be faster in warmup and minimal increase of memory usage. ZJIT being more traditional should bring more speedup than YJIT.
But most of the speedup right now is still coming from rewriting C into Ruby.
Quick glance, this statement seems backwards - shouldn't C always be faster? or maybe i'm misunderstanding how the JIT truly works
Crossing the Ruby -> C boundary means that a JIT compiler cannot optimize the code as much; because it cannot alter or inline the C code methods. Counterintuitively this means that rewriting (certain?) built-in methods in Ruby leads to performance gains when using YJIT. [2]
[1]: https://railsatscale.com/2023-08-29-ruby-outperforms-c/ [2]: https://jpcamara.com/2024/12/01/speeding-up-ruby.html
EDIT: Note that this isn't an inherent limit. You could write a JIT that could analyze the compiled C code too. It's just that it's much harder to do.
E.g. TruffleRuby is fast in part because it will do things like try to avoid method calls for built in types where the standard operations haven't been overridden, but that requires a lot of extra machinery...
So I'm not sure how much compiling to C would help for gems that use C to speed things up.
I think maybe an easier target would be to compile C to a slightly augmented Ruby bytecode. If you control the C compiler you could do things like make C code follow the Ruby calling convention until/unless calling external C code, and avoid a lot of stack overhead.
However they decided it was more useful as a commercial product.
this is fascinating to me. i always assumed C had everything in the language that was needed for the compiler to use. in other words, the compiler may have a lot to work through, but the pieces are all available. but this makes it sound like JIT'd functions provide more info to the compiler (more pieces to work with). is there another language besides C that does have language features to indicate to the compiler how to make things as performant as possible?
It's not necessarily the fact that C doesn't have enough information, it's just that the JIT can reason about Ruby code better than it can about C code. To the JIT, C code is just some function which does things and the only thing it can do with it is to call it.
On the other hand, a Ruby function's bytecode is available to the jit, so if it sees fit, it can copy paste the function body into the call site and eliminiate the function call overhead. Further, after the inlining, it can apply a lot of further optimizations across what was previously a function boundary.
In theory, you could have a way to "compile" the C intrinsics into the JIT's IR directly and that would also give you similar results.
and the features is there when its there
That’s dangerous thinking because constructors will be a bimodal distribution.
Either a graph of calls or objects will contain a large number of unique objects, layers of alternating objects, or a lot of one type of object. Any map function for instance will tend to return a bunch of the same object. When the median and the mean diverge like this your thinking about perf gets muddy. An inline cache will make bulk allocations in list comprehensions faster. It won’t make creating DAGs faster. One is better than none.
Yes, but if it ends up creating any ephemeral objects in the process of determining those returned objects, then the allocation sequence is still not homogeneous. In Ruby, according to the article, even calling a constructor with named arguments allocates, so it's very easy to still end up cycling through allocating different types.
At the same time, the callsite for any given `.new()` invocation will almost always be creating an instance of the exact same class. The target expression is nearly always just a constant name. That makes it a prime candidate for good inline caching at those callsites.
Yes! People might do `map` transformations, but it's very common to do other stuff at the same time. Any other allocations during that transformation would ruin cache hit rate.
> At the same time, the callsite for any given `.new()` invocation will almost always be creating an instance of the exact same class. The target expression is nearly always just a constant name. That makes it a prime candidate for good inline caching at those callsites.
Yes again!
VM implementer intuition only goes so far, and as the internet is the greatest fuzzer invented, you're definitely going to encounter programs that break your best laid plans.
True, but if you only have a single bottleneck cache site for all constructor invocations across the program, the only reasonable thing that callsite can learn is "wow, every single constructed class goes through here".
That's why it makes sense to have a separate cache at every `.new()` location.
Not necessarily. An inline cache is cheap but it's not free, even less so when it also comes with the expense of moving Class#new from C to Ruby. It's probably not worth speeding up the 1% at the expense of the 99%.
> An inline cache will make bulk allocations in list comprehensions faster.
Only if such comprehensions create exactly one type of object, if they create two it's going to slow them down, and if they create zero (just do data extraction) it won't do anything.
We just had this conversation maybe a month ago. If it’s 50-50 then you are correct. However if it’s skewed then it depends. I can’t recall what ratio was discovered to be workable, it was more than 50% and less than or equal to 90%.
https://en.wikipedia.org/wiki/UNCOL
There are countless bytecode based platforms since 1958, including all famous Xerox PARC systems (the CPUs were microcoded and loaded the related translation code on boot), yet WASM is doing it first keeps being brought up.
Do you know what I call WASI containers running on a Kubernetes cluster, or serverless cloud vendors?
Application Server, https://en.wikipedia.org/wiki/Application_server
By comparison, WASM is really more like traditional assembly, only running inside a sandbox.
For some reason when people advocate for WASM outside of the browser, they only remember of the JVM.
In the meantime the CLR happened too.
And - to an extent - LLVM IR.
Security wise, perhaps a different story, though let's wait until WASM is in wide use with filesystem access and bugs start to appear.
That’s how the people working on the project characterized it.