Recently, I refactored both the Go and Python versions to adopt Caffeine’s adaptive algorithm for improved hit ratio performance. But now that Otter v2 has switched to adaptive W-TinyLFU approach and more closely aligned with Caffeine’s implementation, I’m considering focusing more on the Python version.
This feels like a good time to do so: the Python community is actively working toward free-threading, and once the GIL is no longer a bottleneck, larger machines and multi-threads will become more viable. Then a high-performance, free-threading compatible caching libraries in Python will be important.
Another major for on-heap caches that wasn't mentioned their portability: for us that matters because they can compile to WebAssembly.
We don’t want to round trip to a network endpoint in the ideal path, but we run multiple instances of our monolith and we want a shared cache tier for efficiency.
Any architecture/library recommendations?
1. How much data do you have and how many entries? If you have lots of data with very small records, you might need an off-heap based cache solution. The only ready-made implementation I know is Olric [1].
2. If you can use an on-heap cache, you might want to look at groupcache [2]. It's not "blazingly-fast", but it's battle-tested. Potential drawbacks include LRU eviction and lack of generics (meaning extra GC pressure from using `interface{}` for keys/values). It's also barely maintained, though you can find active forks on GitHub.
3. You could implement your own solution, though I doubt you'd want to go that route. Architecturally, segcache [3] looks interesting.
[1]: https://github.com/olric-data/olric
[2]: https://github.com/golang/groupcache
[3]: https://www.usenix.org/conference/nsdi21/presentation/yang-j...
Or if you just want it all to be in-memory, perhaps use some other non-distributed caching library and do the replication via NATS? Im sure there's lots of gotchas with something like that, but Marmot is an example of doing SQLite replication via NATS Jetstream
edit: actually, you can set jetstream/kv to be in-memory rather than file persistence. So, it could do the job of olric or rolling your own distributed kv via nats. https://docs.nats.io/nats-concepts/jetstream/streams#storage...
1) Shared memory - use a cache/key-value lib which allows you to swap the backend to some shmem implementation
2) File-system based - managing concurrent writes is the challenge here, maybe best to use something battle tested (sqlite was mentioned in a sibling)
3) Local sockets - not strictly "no network", but at least no inter-node communication. Start valkey/redis and talk to it via local socket?
Would be interested in the actual use case though, if the monolith is written in anything even slightly modern the language/runtime should give you primitives to parallelize over cores without worrying about something like this at all... And when it comes to horizontal scaling with multiple nodes there is no avoiding networking anyway.
It's easy to understand system with well battle tested library and getting rid of cache is easy, delete the file.
EDIT: I will say for most use cases, the database cache is probably plenty. Don't add power until you really need it.
If so, that’s not how I’d normally think of a distributed cache. When I think of a distributed cache, I’m thinking of multiple instances, likely (but not necessarily) running on multiple nodes. So, I’m having a bit of a disconnect…
1. Even using sync.RWMutex and specialized policies won't really help you outperform a well-implemented BP-Wrapper in terms of latency/throughput.
2. I've never seen cases where W-TinyLFU loses more than 2-3% hit rate compared to simpler eviction policies. But most simple policies are vulnerable to attacks and can drop your hit rate by dozens of percentage points under workload variations. Even ignoring adversarial workloads, you'd still need to guess which specific policy gives you those extra few percentage points. I question the very premise of this approach.
3. When it comes to loading and refreshing, writing a correct implementation is non-trivial. After implementing it, I'm not sure the cache could still be called "simple". And at the very least, refreshing can reduce end-to-end latency by orders of magnitude.
Out of curiosity, has any benchmarking been done to compare Otter etc vs things like (localhost/Unix socket) Redis?
A common pattern, for example, is using on-heap caches together with off-heap caches/dedicated cache servers (like Redis) in L1/L2/Lx model. Here, the on-heap cache serves as the service's first line of defense, protecting slower, larger caches/databases from excessive load.
And, to clarify, I was only thinking about localhost/Unix sockets to mostly eliminate network latency. Anything external to the server would obviously be incomparably slower.
I also suppose that it would be perhaps even more interesting/useful to compare the speed of these in-memory caches to Golang caches/kv stores that have disk persistence, and perhaps even also things like sqlite. Obviously the type of "disk" would matter significantly, with nvme being closer in perf (and probably sufficient for most applications).
Anyway, it was just a thought.
How do you compare to ccache as this is my go to cache library. Well the need is mostly on high traffic endpoints, so LRU it's.
Note that otter's simulator results were repeatedly compared against both W-TinyLFU's (Caffeine) and S3-FIFO's (Libcachesim) simulators, showing nearly identical results with differences within hundredths of a percent.
[1]: https://maypok86.github.io/otter/performance/throughput/
[2]: https://maypok86.github.io/otter/performance/memory-consumpt...
[3]: https://maypok86.github.io/otter/performance/hit-ratio/
I don't quite understand the specific rationale for replacing segmented LRU with S3-FIFO. If I remember correctly, even the original authors stated it doesn't provide significant benefits [1].
Regarding TinyUFO - are you using lock-free queues? Has the algorithmic complexity of TinyLFU changed? (In the base version, S3-FIFO is O(n)). How easy is it to add new features? With lock-free queues, even implementing a decent expiration policy becomes a major challenge.