But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.
The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.
I spent WAYYYYYYYY too much time exploring this...
It might be slightly more instructions than some other serial VL (variable-length) integer codec choices, but overall I don't think it's more difficult.
The very efficient SIMD VL codecs tend to stripe (separate) the control and data bits, so they're in a different design space anyway.
ULEB128 works in SIMD because there's only one dependent bit per byte, so you can speculatively decode and then correct later cheaply. Bijou requires you to check the first byte and then branch based on the value using all 8 bits in the decision matrix (to handle branches 0-247, 248, 249, 250, 251, 252, 253, 254, 255). This absolutely DESTROYS any parallelization opportunities.
Not to mention that non-canonical sized ints (3, 5, 6, 7) have abysmal performance compared to unaligned 2, 4, and 8 byte reads on modern processors.
Even though decoding the lengths must be serial (since's there's no unambiguous way to differentiate a tag and data byte), it's still doable within the wider SIMD registers, so there's some theoretical efficiency gain to be had (depending on the shape of the data).
On a general note, the continuation bit and prefix byte forms are equivalent, you just broadcast the prefix byte and compare against an increasing vector to convert it to a mask. Yeah, there's probably more fiddly SIMD if there are multiple prefixes in the register, but doable (it's just not byte-parallel, you eg. unroll the serial decode loop 8 times or whatever your maximum output byte width is, and mask out).
Simplified:
// Just maps a byte to its position in the register
__m128i idx = _mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
// Broadcast the prefix
__m128i nn = _mm_set1_epi8((char)prefix_byte);
// Get applicable locations: prefix_byte contains the length, if byte_pos < len, the corresponding byte will be set
__m128i m = _mm_cmpgt_epi8(nn, idx);
// If you *really* want a high-bit mask:
m = _mm_and_si128(m, _mm_set1_epi8((char)0x80));Interleaved Bijou has no such signal (tag and payload bytes both span 0x00–0xFF), so finding the boundaries is a dependent per-value walk with no opportunities for parallelism.
The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!
The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.
The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.
https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)
This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.
This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.
LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.
I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.
Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.
10000000 00000000 is the only way to represent 128.
10000000 10000000 00000000 is the only way to represent 16512.
Does this encoding have a name?
[0]: https://www.sqlite.org/src4/doc/1433690d7b/www/varint.wiki
I'm not saying SQLite's varint implementation is ideal for every application. It's just an implementation that is one of the most used implementations, if not the most (I'd bet it is by a large margin though). It just seemed like a missed opportunity to compare it with the implementation they landed on.
EDIT: Just wanted to add, thanks for sharing that link. Interesting!
An adveserial package can claim to have a 255 tagged integer but not actually have any followup, tricking the payload parser into an incorrect offset and reading straight off into followup memory.
It's a classic thing to check for when dealing with variable length strings or binary, but it may not cross the mind when it's hiding in the Bijou64_decode(*buff, *cr) function.
In a contrived example of a pbuf {length:int, payload:byte[1]}
LEB128 can trick you into reading the payload as part of the length, but then hopefully trigger a code check against invalid buffer read. (or one byte outside the struct if the payload is also malicious)
Binou64 can trick you to read 7 bytes into other memory, before any buffer size validation is done.
It's then not uncommon to log with a helpful; "buffer with length: 26624894573377(7 bytes of stolen data) is invalid", or just crash.
It's to the point that Bijou64_decode should perhaps take "end_adress" or "max_read" to catch this kind of attack.
(If you dont validate a malicious pbuf, you're in for a bad time regardless of integer format, but these int formats add their own way to trigger a buffer overrun despite a proper check.)
The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.
> ...adversarial input, which is rarely in the test suite.
This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.
They might have a different definition of adversarial than you.
> My tests for quite pedestrian APIs often contain adversarial input of obvious shapes.
This doesn't seem like what I would call adversarial.
This seems like standard negative testing or boundary value analysis - which I would be shocked if they didn't do.
The first is what they describe here: as an attack. It's like why would anyone ever overflow a buffer with shellcode.
The second is that they are implementing a spec that requires appending a varint length-prefixed field to a buffer but don't really care about the space optimization, don't know the field's length when they start appending it, and don't want to put the field into a second, temporary buffer or slide it down into place. https://github.com/FFmpeg/FFmpeg/blob/468a743af1653a08f47081... vs say my own code which does the slide: https://github.com/scottlamb/retina/blob/6972ac4261ce7bf5b58...
It's uncommon but I've definitely seen it done (with media containers like Matroska, not actually LEB128) in extremely high-throughput systems that can't spare any cycles.
If you can choose a fixed number of bytes for the length prefix, you can skip that number, do the encoding and find out the length, and then come back and fill in the length-prefix after.
But you actually don't know how many bytes it will take without doing all of the work to know the payload length (since larger payloads take more bytes to represent the length).
If you allow overlong representation you can reserve a few bytes and sometimes it'll just be the effective no-op bytes. If you don't, you won't be able to.
I happen to be guilty of a variant of this, where I don't bother emitting a 16-bit floating point number instead of a 32-bit one in my CBOR encoder even if it can be represented exactly. That one is laziness.
Either way, a properly written decoder (and it's like ten lines) should really not have any problems with it. I was agreeing with you.
Edit: to clarify, I was talking about the author's argument being strange, not yours.
Edit: a properly written decoder is a lot more than 10 lines if you properly deal with integer overflow and both signed and unsigned ints.
And say you have it as part of some other data. If you want to be able to hash it by the raw memory bytes, many different ways to represent a number becomes a problem.
If you don't do this properly, you end up with things like: - SAML XSW attack due to XML signature wrapping - ASN.1 BER/DER signature forgery - Bitcoin transaction malleability attacks
1: https://kizu.dev/svg-linked-parameters-workaround/ 2: https://www.seaofclouds.com
> This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.
If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesn’t confuse you.