bzip2's BWT approach benefits from having no heuristics to encode, no lookup tables for offsets, no Huffman table metadata beyond the one coding table. The article's estimate of ~1.5KB for a single-table bzip2 decoder is plausible; I've seen similar results with stripped-down Huffman coders.
The point about zopfli is underappreciated too. People compare gzip vs bzip2 speed without noticing that 'fast gzip' and 'optimal gzip' are very different things. At comparable ratio targets, bzip2 is actually competitive.
My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.
https://insanity.industries/post/pareto-optimal-compression/
Also making good progress on getting a slimmer version of zstd into the stdlib and improving the stdlib deflate.
Awesome! Please let me know if there is anything I can do to help
Does gmail use a special codec for storing emails ?
Yes, I do. Zstd is my preferred solution nowadays. But gzip is not going anywhere as a fallback because there is a surprisingly high number of computers without a working libzstd.
TIL. Now that's why gzip has a file header! But, tar.gz compresses even better, that's probably why it hasn't caught on.
that being said, speed is important for compression so for systems like webservers etc its an easy sell ofc. very strong point (and smarter implementation in programs) for gzip
Long comment to just say: ‘I have no idea about what I’m writing about’
These compression algorithms do not have anything to do with filesystem structure. Anyway the reason you can’t cat together parts of bzip2 but you can with zstd (and gzip) is because zstd does everything in frames and everything in those frames can be decompressed separately (so you can seek and decompress parts). Bzip2 doesn’t do that.
So like, another place bzip2 sucks ass is working with large archives because you need to seek the entire archive before you can decompress it and it makes situations without parity data way more likely to cause dataloss of the whole archive. Really, don’t use it unless you have a super specific use case and know the tradeoffs, for the average person it was great when we would spend the time compressing to save the time sending over dialup.
https://github.com/facebook/zstd?tab=readme-ov-file#benchmar...
In my own testing of compressing internal generic json blobs, I found brotli a clear winner when comparing space and time.
If I want higher compatibility and fast speeds, I'd probably just reach for gzip.
zstd is good for many use cases, too, perhaps even most...but I think just telling everyone to always use it isn't necessarily the best advice.
It’s slower and compresses less than zstd. gzip should only be reached for as a compatibility option, that’s the only place it wins, it’s everywhere.
EDIT: If you must use it, use the modern implementation, https://www.zlib.net/pigz/
size and decompression are the main limitations