This was a client side framework, in the OPs parlance. What's missing in OP is the insight that the service side load balancer can also fail -- what will load balance the load balancers? We performed registration based on health checks from a sidecar, and then we also did client side checks which we called connectivity checks. Multiple client instances can disagree about the state of the world because network partitions actually can result in different states of the world for different clients -- again, an inescapable reality.
Finally, you do also still need circuit breakers. Health checks are generally pretty broad, and when a single endpoint in a service begins having high latency, you don't want to bring down the entire client service with all capacity stuck making requests to that one endpoint. This specific example is probably more relevant to the old days of thread and process pools than to modern evented/async frameworks, but the broader point still applies
the article focuses on detection speed but misses the equally important problem of recovery speed. backends that come back after a failure often get thundering-herded by all the clients that simultaneously notice the recovery. connection ramping (slowly increasing traffic to a recovered backend) is just as important as fast detection.
* for client-side load balancing, it's entirely possible to move active healthchecking into a dedicated service and have its results be vended along with discovery. In fact, more managed server-side load balancers are also moving healthchecking out of band so they can scale the forwarding plane independently of probes.
* for server-side load balancing, it's entirely possible to shard forwarders to avoid SPOFs, typically by creating isolated increments and then using shuffle sharding by caller/callee to minimize overlap between workloads. I think Alibaba's canalmesh whitepaper covers such an approach.
As for scale, I think for almost everybody it's completely overblown to go with a p2p model. I think a reasonable estimate for a centralized proxy fleet is about 1% of infrastructure costs. If you want to save that, you need to have a team that can build/maintain your centralized proxy's capabilities in all the languages/frameworks your company uses, and you likely need to be build the proxy anyways for the long-tail. Whereas you can fund a much smaller team to focus on e2e ownership of your forwarding plane.
Add on top that you need a safe deployment strategy for updating the critical logic in all of these combinations, and continuous deployment to ensure your fixes roll out to the fleet in a timely fashion. This is itself a hard scaling problem.
The connection being active doesn't tell you that the server is healthy (it could hang, for instance, and you wouldn't know until the connection times out or a health check fails). Either way, you still have to send health checks, and either way you can't know between health checks that the server hasn't failed. Ultimately this has to work for every failure mode where the server can't respond to requests, and in any given state, you don't know what capabilities the server has.
The article explores how client-side and server-side load balancing differ in failure detection speed, consistency, and operational complexity.
I’d love input from people who’ve operated service meshes, Envoy/HAProxy setups, or large distributed fleets — particularly around edge cases and scaling tradeoffs.
Also, in HAProxy (that's the one I know), server side health checks can be in millisecond intervals. I can't remember the minimum, I think it's 100ms, so theoretically you could fail a server within 200-300ms, instead of 15seconds in your post.
You need to be careful here, though, because the server might just be a little sluggish. If it's doing something like garbage collection, your responses might take a couple hundred milliseconds temporarily. A blip of latency could take your server out of rotation. That increases load on your other servers and could cause a cascading failure.
If you don't need sub-second reactions to failures, don't worry too much about it.
<meta name="viewport" content="width=device-width, initial-scale=1" />
For us who need to zoom in on mobile devices.