First is that, if the parsing library for your codec includes a compiler, VM, and PGO, your codec must be extremely cursed and you should take a step back and think about your life.
Second is that, if the parsing library for your codec includes a compiler, VM, and PGO, your codec must be wildly popular and adds enormous value.
If it's compilation at run-time, then I agree: it should be done at build time. But in this case it's really not a big deal.
If you're objecting to needing a compiler, then... you're not even wrong.
I’ve assumed that writing fast interpreters wasn’t a use case the Go team cared much about, but if it makes protobuf parsing faster, maybe it will get some attention, and some of these low-level tricks will no longer be necessary?
Don't count on it. This language is Golang.
If so, how does it compare in practice?
(what does Cloudflare Workers use?)
I have it on my TODO list to write a blog post about Workers' use of Cap'n Proto. By far the biggest wins for us are from the RPC system -- the serialization honestly doesn't matter so much.
That said, the ecosystem around Cap'n Proto is obviously lacking compared to protobuf. For the Cloudflare Workers Runtime team specifically, the fact that we own it and can make any changes we need to balances this out. But I'm hesitant to recommend it to people at other companies, unless you are eager to jump into the source code whenever there's a problem.
It’s frustrating when we can’t have nice things because a mediocre Google product has sucked all the air out of the room. I’m not only talking about Protobufs here either.
I've used it but too long ago and don't know their current state.
1) The flatbuffer parser can be configured at runtime from a schema file, so our message passing runtime does not to need to know about any schemas at build time. It reads the schema files at startup, and is henceforth capable of translating messages to and from JSON when required. It's also possible to determine that two schemas will be compatible at runtime.
2) Messages can be re-used. For our high-rate messages, we build a message and then modify it to send again, rather than building it from scratch each time.
3) Zero decode overhead - there is often no need to deserialise messages - so we can avoid copying the data therein.
The flatbuffer compiler is also extremely fast, which is nice at build time.
However, my preference is to use something like MsgPack or CBOR with compile time reflection to do serde directly into types. You can design the types to require minimal allocations and to parse messages within a few dozen nanoseconds. That means doing things like using static char arrays for strings. It wastes a bit of space but it can be very fast. Also skipping out on spaced used by 64bit pointers can replace a lot of shorter text fields.
That said, I should wrap or port this UPB to Nim. It'd be a nice alternative if it's really as fast as claimed. Though how it handles allocating would be the clincher.
They support unions now. I haven't had any trouble representing anything including recursive structures.
> The code output was a pain, but targeted a few languages.
You do need a day or so to get used to the code. Its a pointer based system with a 'flat memory'. Order of operations matters. If you have a parent and a child you need to write the child first, obtain a pointer to it and only then create/write the parent containing the pointer to the child. Once you get used to this it goes quickly. The advantage is that you don't have to create an intermediate copy in memory when reading (like protobuf) and you can read any particular part by traversing pointers without having to load the rest of the data into memory.
The services serve both CBOR and other protocols simultaneously.
We used to store animation data in mmaped flatbuffers at a previous gig and it worked really well. Kernel would happily prefetch on access and page out under pressure, we could have 10s of MBs of animation data and only pay a couple hundred kb based on access patterns.
This is kind of confusing, the VM is runtime crafted to parse a single protobuf message type and only this message type? The Second Futamura Projection, I suppose...
Or the VM is designed specifically around generic protobuf messages and it can parse any random message but only if it's a protobuf message?
I've been working on the design of a similar system but for general binary parsing (think bison/yacc for binary data) and hadn't even considered doing data over specialized VM vs. bytecode+data over general VM. Honestly, since it's designed around 'maximum laziness' (it just parses/verifies and creates metadata over the input so you only pay for decoding bytes you actually use) and I/O overhead is way greater than the VM dispatching trying this out is probably one of those "premature optimization is the root of all evil" cases but intriguing none the less.
Calling a Protobuf Parser an "interpreter VM" is a little bit of rhetorical flourish. It comes from the observation that there are some deep structural similarities between the two, which I first observed in an article a few years back: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...
> It may seem odd to compare interpreter loops to protobuf parsers, but the nature of the protobuf wire format makes them more similar than you might expect. The protobuf wire format is a series of tag/value pairs, where the tag contains a field number and wire type. This tag acts similarly to an interpreter opcode: it tells us what operation we need to perform to parse this field’s data. Like interpreter opcodes, protobuf field numbers can come in any order, so we have to be prepared to dispatch to any part of the code at any time.
This means that the overall structure of a protobuf parser is conceptually a while() loop surrounding a switch() statement, just like a VM interpreter.
The tricky part is that the set of "case" labels for a Protobuf parser is message-specific and defined by the fields in the schema. How do we accommodate that?
The traditional answer was to generate a function per message and use the schema's field numbers as the case labels. You can see an example of that here (in C++): https://github.com/protocolbuffers/protobuf/blob/f763a2a8608...
More recently, we've moved towards making Protobuf parsing more data-driven, where each field's schema is compiled into data that is passed as an argument to a generic Protobuf parser function. We call this "table-driven parsing", and from my read of the blog article, I believe this is what Miguel is doing with hyperpb.
The trick then becomes how to make this table-driven dispatch as fast as possible, to simulate what the switch() statement would have done. That question is what I cover at length in the article mentioned above.
The generated code in C++ used to check for the expected next field before falling back to the switch() (example here: https://github.com/protocolbuffers/protobuf/blob/460e7dd7c47...) but this was removed in 2016 when load tests found that it hurt more than it helped.
One tricky part of making this optimization work is making a good guess about what the next field should be. Miguel's article alludes to this:
> Each field specifies which fields to try next. This allows the compiler to perform field scheduling, by carefully deciding which order to try fields in based both on their declaration order and a rough estimation of their “hotness”, much like branch scheduling happens in a program compiler. This avoids almost all of the work of looking up the next field in the common case, because we have already pre-loaded the correct guess.
> I haven’t managed to nail down a good algorithm for this yet, but I am working on a system for implementing a type of “branch prediction” for PGO, that tries to provide better predictions for the next fields to try based on what has been seen before.
One delightful thing about the tail-call parser design is that the CPU's branch predictor effectively takes over the job of guessing. With a tail call parser, the dispatch sequence ends up looking like this:
cmp QWORD PTR [rdi+0x8],rsi # Bounds check
jbe .fallback
movzx r10d,WORD PTR [rsi] # Load two bytes of tag
mov eax,r10d
and eax,0xf8
mov r9,QWORD PTR [rcx+rax*2] # Load table data
xor r9,r10
mov rax,QWORD PTR [rcx+rax*2+0x8] # Load field parser function
jmp rax # Tail call to field parser
That "jmp rax" instruction is an indirect branch that can be predicted by the CPU. The CPU effectively guesses for us!And unlike any kind of guess we might have performed ahead-of-time, the branch predictor will constantly adapt to whatever patterns it is seeing in the data. This is good, because statically guessing ahead-of-time is hard.
You could probably even trace the history of the idea all the way to Van Jacobson’s 30-instruction TCP fastpath[1]. Or to go a bit closer, I’ve found that an interpreter for a stack+accumulator VM (which, compared to the pure stack option, is prone to blowing up the bytecode count and thus dispatch cost with the constant PUSH-accumulator instructions) goes significantly faster if you change the (non-shared) dispatch from
return impl[*pc](pc, ...);
to if (*pc == PUSH) {
do_push(...); pc++;
}
return impl[*pc](pc, ...);
which feels somewhat analogous to the next-field optimization and avoids polluting the indirect branch predictor with the very common PUSH predictions. (It’s still slower than not having those PUSHes in the first place.)[1] https://www.pdl.cmu.edu/mailinglists/ips/mail/msg00133.html
Everything old is new again, I guess—one of the more advertised changes in Microsoft COM as it was maturing (circa 1995) was that you could use data-driven marshalling with “NDR format strings” (bytecode, essentially[1,2]) instead of generating C code. Shortly after there was typelib marshalling (format-compatible but limited), and much later also WinRT’s metadata-driven marshalling (widely advertised but basically completely undocumented).
Fabrice Bellard’s nonfree ASN.1 compiler[3] is also notable for converting schemas into data rather than code, unlike most of its open-source alternatives.
I still can’t help wondering what it is, really, that makes the bytecode-VM approach advantageous. In 1995, the answer seems simple: the inevitable binary bloat was unacceptable for a system that needs to fit into single-digit megabytes of RAM; and of course bytecode as a way to reduce code footprint has plenty of prior art (microcomputer BASICs, SWEET16, Forth, P-code, even as far back as the AGC).
Nowadays, the answer doesn’t seem as straightforward. Sure, the footprint is enormous if you’re Google, but you’re not Google (you’re probably not even Cloudflare), and besides, I hope you’re not Google and can design stuff that’s actually properly adapted to static linking. Sure, the I$ pressure is significant (thinking back to Mike Pall’s explanation[4] why he didn’t find a baseline JIT to be useful), but the bytecode interpreter isn’t going to be a speed demon either.
I don’t get it. I can believe it’s true, but I don’t really feel that I get it.
[1] https://learn.microsoft.com/en-us/windows/win32/rpc/rpc-ndr-...
[2] https://gary-nebbett.blogspot.com/2020/04/rpc-ndr-engine-dce...
I think its a very illuminating way to describe it. Nicely done with the implementation as well.
This is a very insightful design. Encoding the parser in a table is an old technique, it's what YACC uses. There's a tradeoff between using the CPU's stack+registers versus having your own stack+state in this kind of work, and people have found situations where a small VM is faster because it gets to stay in the instruction cache and benefit from branch prediction, while the less-predictable data stays in the d-cache.
Interesting times we live in...
1. Go FFI is slow
2. Per-proto generated code specialization is slow, because of icache pressure
I know there's more to the optimization story here, but I guess these are the primary motivations for the VM over just better code generation or implementing a parser in non-Go?
But eventually your compiler is good enough that the FFI Is now your bottleneck, and you need to do something.
A C function won't know how to behave in Go's runtime environment, so to call a C function Go needs make itself look more like a C program, call the C function, and then restore its magic state.
Other languages like C++, Rust, and Swift are similar enough to C that they can just call C functions directly. CPython is a C program, so it can too. Golang was brave enough to do fundamental things its own way, which isn't quite C-compatible.
Go (gc) was also a C program originally. It still had the same overhead back then as it does now. The implementation language is immaterial. How things are implemented is what is significant. Go (tinygo), being a different implementation, can call C functions as fast as C can.
> ...so it can too.
In my experience, the C FFI overhead in CPython is significantly higher than Go (gc). How are you managing to avoid it?
Other languages generally fall into either camp of having a C-like stack and thread model and easy FFI (e.g. Ruby, TCL, OCaml) and maybe having futures/async but not in an invisible/magic way, or having a radically different threading model at the cost of FFI being slow and painful (e.g. Erlang). JavaScript is kind of special in having C-like stack but being built around calling async functions from a global event loop, so it's technically the first but feels more like the second.
https://docs.oracle.com/en/java/javase/24/docs/specs/jni/int... mentions JRI.
But it seems like JNI has been replaced by third party solutions multiple times as well.
https://developer.okta.com/blog/2022/04/08/state-of-ffi-java...
Calling C safely is then slow because you have to allocate a larger stack, copy data around and mess with the GC.
It's about the same as most other languages that aren't specifically optimized for C calling. Considerably faster than Python.
Which is funny as everyone on HN loves to extol the virtues of Python being a "C DSL" and never think twice about its overhead, but as soon as the word Go is mentioned its like your computer is going to catch fire if you even try.
Emotion-driven development is a bizarre world.
Calling C from Go (or vice versa) often requires switching from Go's lightweight goroutine model to a full OS thread model because:
- Go's scheduler manages goroutines on M:N threads, but C doesn't cooperate with Go's scheduler.
- If C code blocks (e.g., on I/O or mutex), Go must assume the worst and parks the thread, spawning another to keep Go alive.
* Cost: This means entering/exiting cgo is significantly more expensive than a normal Go call. There’s a syscall-like overhead.
... This was only the first issue, but then it follows with "Go runtime can't see inside C to know is it allocating, blocking, spinning, etc.", then "Stack switching", "Thread Affinity and TLS", "Debug/Profiling support overhead", "Memory Ownership and GC barriers"All here - https://chatgpt.com/share/688172c3-9fa4-800a-9b8f-e1252b57d0...
Please do not take the numbers below at face value. I still expect an actual reply to my initial comment.
Per-call overhead:
C (baseline) - ~30 ns
Rust (unsafe) - ~30 ns
C# (P/Invoke) - ~30-50 ns
LuaJIT - ~30-50 ns
Go (cgo) - ~40-60 ns
Java (22, FFM) - ~40-70 ns
Java (JNI) - ~300-1000 ns
Perl (XS) - ~500-1000 ns
Python (ctypes) - ~10,000-30,000 ns
Common Lisp (SBCL) - ~500-1500 ns
Seems like Go is still fast enough as opposed to other programming languages with GC, so I am not sure it is fair to Go.Language/API | Call Overhead (no-op C) | Notes
Go (cgo) | ~40–60 ns | Stack switch + thread pinning
Java FFM | ~50 ns (downcall) | Similar to JNI, can be ~30 ns with isTrivial()
Java FFM (leaf) | ~30–40 ns | Optimized (isTrivial=true)
JNI | ~50–60 ns | Slightly slower than FFM
Rust (unsafe) | ~5–20 ns | Near-zero overhead
C# (P/Invoke) | ~20–50 ns | Depends on marshaling
Python (cffi) | 1000–10000 ns | Orders of magnitude slower |
you can verify this easily with Deno ffi which is pretty much optimal for JS runtimes. also, from everything i have seen and read, luajit should be even lower overhead than this.
you really shouldn't be asking chatgpt questions like this imo. these are facts, that need to be proven, not just vibes.
As if there is an alternative :)
More seriously, it’s “unsafe” from the perspective of the library calling into C, but usually “safe” for any layer above.
For what it is worth, I asked about LuaJIT after I have shared the link, and the numbers are now different for some languages, albeit not by much. Go (cgo) became ~50-100 ns. That said, I still believe it is unfair to single out Go when it does way better than some other GC languages.
"ants_everywhere" is right.
The anti-LLM activists that patrol HN will often downvote comments just for stating facts that disagree with their hatred of LLMs.
Interesting. This makes me wonder how nanopb would benchmark against the parsers shown in the graphs. nanopb's whole schtick is that it's pure C99 and it doesn't generate separate parsing functions for each message. nanopb is what a lot of embedded code uses, due to the small footprint.
Hyperpb: Faster dynamic Protobuf parsing - https://news.ycombinator.com/item?id=44661785
I see a second effect that's probably more regressive: using protobufs (designed for RPC) as a database schema definition language or a disk data structure description language.
Much prefer something like typespec to describe types.And then derive the rpc and disk schema languages from that programatically.
My first attempt was to use the flatbuffer schema language as a starting point. But being attached to a serialization format can be a net negative. Did not find traction.
Motivation: can not stand field numbers in a database schema when other migration mechanisms exist.
I’d make the same argument for gRPC-web to something like WHATWG streams and or WebTransport.
There is a lot of really cool and important learnings in both but it’s also so tied up in weird tooling and assumptions. Let’s rebase on IETF and W3C standards
I'm not sure the purpose, though. Protobuf is great for its inflexible schema, and CBOR is great for its flexible data representation.
A separate CBOR schema would be a better fit, there's CDDL but it has no traction.
EDIT: Yes https://github.com/bufbuild/hyperpb-go/blob/main/internal/td...
This is very cool. The ASN.1 compiler I [sort of] maintain has an option to generate C code or a "template", and the "template" is just an AST that gets interpreted at run-time. I did not come up with that idea, but the people who did did it because it's faster to do the AST interpretation thing than to generate lots of object code -- the reason is that the templates and the template interpreter are smaller than the object code for the alternative, and cache effects add up.
I suppose the insight is that schema documents are programs, just in a weird language, and compiled programs are fast.
This made me LOL.
Please do one on your analysis and optimization workflow and tooling!
there is also flatbuffers, capnp and purely C++ serializers: zpp, yas... but protobufs are definitely convenient !
What is a real issue is marketing performance. As it turns out, everyone loves a story about something going fast. So it pays to create these kinds of optimizations for the sake of the story, used to sell the product to the developer who makes choices based on what they think feels good.
(Technically protobuf supports JSON encoding, but its additional work to enable, again diminishing the developer experience)
It feels like arenas are just an attempt to bring that back.
The benchmarking here is... pointless. Here's my story (of writing a Protobuf parser) to illustrate why. I already wrote about it before, but it keeps happening, so, maybe repeating it is not such a bad thing.
So... the official (coming from Google) Protobuf parser for Python is ridiculously bad. It generates Python code for the parser. Not only is this unnecessary, it's also executed poorly. So, I decided to roll my own. Also, I decided to make it multi-threaded (I wrote it in C, with Python bindings).
Then I tried to benchmark it against the official implementation. I don't trust the quality of Python language implementation, so, that came under the spotlight first. For example, I chose Python's enumerators to represent Protobuf enumerators. Turned out that was a horrible decision because instantiation of Python's enumerators is extremely resource-intensive. I had to make a few similar adjustments, searching for the less resource-hungry Python built-in data-structures.
And yet I was still far behind on my benchmarks. So, I tried to understand what the Google's C++ code was doing. And then I realized that Google's parser in the "first pass" only extracted the top-level messages, not even trying to parse the hierarchical structure. This structure was only extracted on-demand. So, essentially, I was competing against memcpy()...
But, here's another catch: most real-world applications will not want the "intermediate" Protobuf structure made up of general-purpose data-structures such as dictionaries or enumerators. They will want them immediately parsed into domain-specific data-structures. So, a parser that offers this kind of functionality might be slower to parse into generic data-structures, but will win in real-world benchmarks.
Of course, the percentage of useful payload is also important. Typically, applications underutilize the messages they exchange. But, to what degree is different and completely dependent on application design. Similarly, mapping into domain-specific objects will depend on the sort of information the application wants to exchange. Protobuf is bad for column / tabular data, but somewhat better for object graph kind of data. Protobuf is bad for large messages with duplicated content, but is somewhat better for short messages with little duplication.
All of this, and much more, allows for a lot of variability in benchmark design, and would allow me, or anyone writing a parser to skew the numbers in whatever direction I want... to amuse the general public.