Fresh Hacker News | Production engineering when trading billions of dollars a day [video]

▲Production engineering when trading billions of dollars a day [video](youtube.com)

126 points by abstrus 1 day ago | 5 comments

▲jedberg 12 hours ago

This is a good talk. Really gets into the details of how things differ from the classical SaaS or consumer product.

I've been doing reliability for most of my career, and have always been able to hide behind, "We're not a bank, if we lose a few requests it doesn't matter". They can't do that. :)

One advantage that they have is that the market closes, so they can do maintenance that takes the whole system down, but when you're running a global consumer product, it's a lot harder to do that without pushback.

So for most of us, our stress is around zero downtime maintenance, and theirs is around never dropping a request when the system is live.

▲alemwjsl 21 minutes ago

Not sure what the practical difference is (24/7 vs ~10/5) except for the convenience when planning data migrations if you have regularly planned downtime.

For most code changes being turned off at night isn't much of an advantage, as the new code will need to go live at some point and that point is where the risk is. For systems on 24/7 you simply need a copy of your production environment to test on, a.k.a. staging.

The main thing about 24/7 is needing follow-sun SRE and/or out of hours oncall.

▲cyberpunk 11 hours ago

Yeah, I work on systems with reliability requirements like this at a large bank.

There are multiple layers of controls and manual interventions and things, which while absolutely painful, slow, expensive and shitstorm-conjuring -- are ultimately the final authority on some failures.

For e.g, in payments -- every single settlement or clearing anomaly is looked at by a real human, and rectified/rebooked manually.

So, yeah, the stakes can be really high when you have a couple billion in memory on your server, but -- it's just a system.

And it will fail, and we plan for it to do so.

▲gricardo99 12 hours ago

there’s a move now towards 24/7 trading. I guess we’ll see how the rigors of the trading environment mesh with zero down time. I’m sure the rollout will be slow and steady.

▲jedberg 11 hours ago

I've seen that. I suspect the exchanges will never go for it for this exact reason -- they need downtime for maintenance. But if does go through, it will be a fun challenge to get 100% uptime!

I've always said that with infinite money we could get 100% uptime, but no one has infinite money. Trading firms are about as close as I can imagine to infinite money though.

▲skippyboxedhero 10 hours ago

How do you think on-chain exchanges do it? Hyperliquid has 16 employees, not engineers...total employees. It is possible, it isn't going to be possible for many of the legacy exchanges.

I work with a major one and, being honest, from day one it was obvious they were incompetent. They employ a huge number of engineers and are unable to deliver basic features at any reasonable pace. Not even remotely close to it either (as in: you ask them to do something, they say yes, execs say yes, you get a deadline, date comes...deployment difficulties, environment not working, run around goes on and on forever).

I remember the CEO got on a call with us at the start and was slapping himself on the back saying they had no downtime...because they were able to do maintenance when markets shut (and have heard very bad things about how that goes). But it is 24/7 world now, our service is up 24/7 and, of course, this led to massive issues in time due to the very different expectations around delivery/quality. Our execs were impressed, our engineers said this was a bad sign. And, ofc, it transpired that they were total amateurs (to be clear, this is one of the biggest exchanges in the world) and were unable to deliver.

To come back to my original statement: there is a company of 16 people total who is, from the point of view of customers, delivering features faster. It is difficult to understate how insane that is.

▲jedberg 10 hours ago

From what I've seen, on-chain transaction times are measured in seconds and minutes, not milliseconds. It's a lot easier when you have time to wait to process a queue.

▲skippyboxedhero 10 hours ago

Fastest ones are processing a block every 10ms.

It depends what you mean by easy. Even if you are using a slow chain, you still have to compete for finite block space, you still have to work out how to risk/matching fast, etc.

With chains built for exchange use, operating them easier, that is why they don't require thousands of engineers. But the actual technical capability of the system is significantly in excess of tradfi exchanges. For example, risk function is real-time on-chain as opposed to EoD settlement. This significantly changes the possible feature set. Once you have built it, it is very easy...the question is why big exchanges rely so heavily on eod processes? The answer is: they are bad at engineering.

▲foobar10000 9 hours ago

The EOD reconciliation (and corresponding inability to settle a position in milliseconds) is a feature - it allows "obvious erroneous trade" roll-back mechanisms, etc.

Very few people want the financial system to be a contractual suicide pact - they want it to be predictable, but when the unpredictable happens - they want the retail and institutional investor to be protected (the HFT players can go beat each other up - no one will really cry about them). And unpredictable can be anything from a power event taking out multiple exchanges in the NJ triangle (Sandy hurricane) to a cyber-attack (never happened yet) to a flash-crash driven by algorithms from multiple HFT driving each other nuts (happened at least once).

So, it is not EOD processes as such, but the ability to pause, assess the entire system holistically, and then correct it before it blows up the portfolios of everyone holding a 401k. So even though the exchanges _could_ got to 24/7 trading, I'd be surprised if we just went away from cyclical 24-hr based windows of settlement.

▲woah 7 hours ago

> they want the retail and institutional investor to be protected

That costs money, indirectly

▲foobar10000 4 hours ago

But it does allow these investors to participate in the markets without losing their shirts - and the lack of such liquidity would impact the market more so than the cost of the risk mitigation - which as you completely correctly noted is not free - both in first and second order terms.

▲infecto 9 hours ago

Hyperliquid while big in crypto is still small compared to mainstream financial markets.

I don’t think you have made a case for anything yet.

▲itsthecourier 9 hours ago

so let me get this straight Skippy, you're saying you got better performance and reliability than the LMAX disruptor with Multicast that runs inside many big exchanges?

I have a really hard time believing something decentralized will surpass the the physical limitations of speed of light and low level assembler from C++ optimizations without any GC

also the fact that hyperliquid sequencing of orders is opaque and not opensource, and there is indeed latency in the consensus, I cannot believe yet there are p99 stability in completed transactions

▲cgio 7 hours ago

The first 90% of features takes 10% of the time to deliver. You are comparing capital infrastructure markets with deep regulatory obligations and multiple stateful interfaces (OUCh/FIX) to retail focused matching engines with a very slim stateless protocol surface (REST).

▲amluto 11 hours ago

An amusing, moderately expensive solution that might actually work would be to have a weekday system and a weekend system. Think of it as a spare D/R system that you intentionally swap twice a week :)

If done right, it would be a complete separate system. Separate IP addresses and all.

▲nippoo 11 hours ago

That's effectively time-based request sharding which seems sensible but you'd still have to reconcile trades and any open positions (etc) across the time boundary where one system stops accepting requests and the other one starts. And keep the databases synchronous (ie have some system to make sure they're in sync at the changeover time) - or have a few minutes/hours of downtime between weekends and weekdays while you copy the whole production database from one system to another. The devil is in the details!

▲justinclift 9 hours ago

Heh, maybe they'll develop a sudden interest in the old Vax VMS clustering approach? ;)

▲gigatexal 11 hours ago

I hated my time as an SRE. But … can’t it be done with some combination of canaries and blue green deployments and extensive testing? Where when things look good you just swap all the traffic to the good stuff keeping the rollback hot etc etc?

▲jedberg 11 hours ago

That's how we got 99.99% at Netflix. And it cost a lot of money. But a canary implies that something may go wrong and you have to roll back. The canary is still production traffic, so some transactions would fail, which isn't allowed for this kind of workload.

I image you'd have to use shadow execution, where you roll out a full second copy, run every transaction through both, and compare the results. And then, only after a certain time, switch traffic to the new infra and tear down the old.

But you would need a ton of extra hardware (more than double) and a lot of ways to keep data in sync. And of course if you put an LLM or other non-deterministic system in there, that's a whole other can of worms.

Like I said, a fun problem to solve. :)

▲gigatexal 11 hours ago

Folks that keep the lights on 24/7 aka SREs are super heroes that wear capes. Thank you for your service.

I couldn’t do it. I like infra and all but it’s just not my cup of tea. Def true that in a trading pov the trade must be executed. It must settle. It must work. Or capital flight will be huge.

▲cgio 6 hours ago

There are different kinds of updates that influence options and feasibility. Keeping in mind that deep in the heart of an exchange is a single threaded process, the sequencer. Therefore, you have three layers, external facing protocols, sequencer/matching engines, and internal interfaces. Internal interfaces are the easiest for b/g. External protocols, any change worth its weight changes the protocol and therefore requires participants to change their codebases too. Versioning protocols is an option, but still the integration with consumers is much more transparent and usually you have them test on pre-prod environments, occasionally also requiring attestation and conformance testing (regulated markets). Sequencer and matching engine are at the core. You could do parallel runs but not b/g. Theoretically you could abstract the matching engine and keep a barebones sequencer immutable, but this will have performance implications. So yes, you can do things, but not in a completely transparent way, unless if you introduce an “upgrade jitter” to give you a window for transparent upgrades. It’s an interesting domain, I think people will just accept occasional downtimes as a better option than constant jitter cost.

▲bostik 1 hour ago

Sports betting exchanges have been doing that for a very long time. There is never a good time to take the system down for maintenance - event settlements happen every few minutes, and live games with in-play betting are going on somewhere in the world at any given time.

Makes things damn hard indeed, because you have to truly learn asynchronicity, CQRS and complex live migrations. (Incidentally, engineers who have worked on such systems tend to be over-represented in extreme HA businesses.)

▲cgio 7 hours ago

Only US. Other markets barely have liquidity during daytime and get most liquidity in opening and closing auctions. Maintenance periods are actually a complication. A few more state transitions for the system, but barely used for maintenance. The only value is for upgrades, which would still be scheduled with the market down and systems up, as participants also need to transition codebases for breaking changes, a test weekend or more is required etc. These systems are extremely resilient. You most often get an incident not because the system is down but because the latency profile has changed by a few ms.

▲willtemperley 5 hours ago

Not really US only. LMAX is 24/7 and is a UK company, famous on HN for open sourcing their ring buffer.

Crypto trading has been 24/7 since it began.

▲cgio 5 hours ago

Ok, I will make it mostly US. Maybe a couple more markets, London and Tokyo? Futures may be a bit broader adoption too. Vast majority won’t move to 24/7. Crypto is a different game for the time being at least. It has its own challenges but also escapes quite a few of traditional exchange complexities.

▲TacticalCoder 11 hours ago

> there’s a move now towards 24/7 trading.

Isn't the plan more like 23/5 like is already the case for several markets?

I can't see the standard sessions moving more 9:30am/4pm weekdays to 24/7. I take it they'd still let, at least, one hour off for technical reasons.

If I'm not mistaken it's the reason several markets are 23/5 and not 24/5: that one hour of downtime is basically for servers/maintenance right? (maybe someone can chime in)

P.S: I take it technically there's 24/7 trading already seen that cryptocurrencies exchanges are opened 24/7 (I'm not sure: but I think that's the case) but I don't think those do anywhere near the volume of, say, options trading on equities during standard sessions (40 Gbit/s with peak over 70 Gbit/s for the full options feed).

▲dmurray 11 hours ago

The 23/7 is not so much for maintenance as to have a defined window for changes to the market to happen.

Every so often a new stock is listed or a stock ticker is changed or a stock is split, etc. There are smaller changes every single day, like to the settlement date of your trade.

It's very convenient to be able to restart all your systems at 5pm, have them all load the updated reference data, and start them again in time for 6pm (or 7pm, or 4am tomorrow...). Even if you trade stocks and options and currencies and futures all over the world, a quirk of the calendar means they're basically all closed between 4 and 5pm Chicago time.

Of course it's possible in principle to build systems where all this is dynamic and you can seamlessly trade with the old configuration at 4:59:59.999 and start trading the new one a millisecond later. But literally everyone has built systems that don't work on this, that rely on being able to chunk the continuous passage of time into discrete days. It would be painful to rearchitect them all now.

▲sikozu 9 hours ago

I love talks like this so much. Trading isn't something I hardly ever think about as I live/work in a bubble of sorts, and it was a fascinating listen.

I have heard similar talks from Shopify and such back in the day, about their own product, but always love listening to more.

▲derwiki 6 hours ago

Which Shopify talks?

▲xyst 8 hours ago

Standard SRE operations. Didn’t find anything notable.

The clickbait title of "billions of dollars a day" is nothing to praise.

▲subscribed 6 hours ago

LOL, no. That's bisecting the patches to find the spurious latency jitter in the critical path, that's, carefully planning apps to fit the specific NUMA design, being on the first name basis from the engineering of NIC vendor, etc, etc.

It's fun, because one lost or late packet is an issue immediately red in the monitoring.

I've been SRE too and the most it brought to the table is a concept of error budget.

I can only agree that "billions of dollars" in trades is not much.

▲jdw64 11 hours ago

[dead]

▲laidoffamazon 9 hours ago

I'll never understand how these cognitive elites live. They're just a completely different kind of human than the rest of us.

▲LPisGood 8 hours ago

What do you mean by “cognitive elites?”

I’ve met some exceptional people: top researchers from top universities from several fields, super well paid engineers working on products you probably use, some of the best hackers an advanced persistent threat actor could ask for; they’re just people.

I think if you get a collection of competent, thoughtful people together they would come up with similar solutions to the problems discussed in this talk.

▲seb1204 6 hours ago

I am sure they are, my fascination is how they manage to get so much more out of the same 24 hours than me. Sometimes I would just like to know how others manage all the worldly churn that seems to suck my time. starting from cleaning the kitchen, toilets and the home, food prep and washing up, being active, moving the lawn, reading a good book, doing taxes, bringing the vehicle to service, fix that phone for grandma etc. so how do they do it? Is it character, personality, upbringing... What helped put them on this trajectory. So in short I'm curious about their story.

▲linkregister 1 hour ago

I assume that Jane Street employees likely use house cleaners, food delivery, laundry services, etc. I wouldn't be the least bit surprised if part of the employee onboarding includes a list of common convenience services like these. Some employers pay for or subsidize such services; I don't know if Jane Street does.

Personal sacrifice is often expected. Fixing the phone for grandma would simply be neglected for many employees in these positions.

▲hansvm 1 hour ago

> so much more out of the same 24 hours

Do they? Are they doing more work, or is the set of things they've chosen to be good at more stereotypically impressive than the set of things you've chosen to be good at?

> all the worldly churn

Surely all of that adds up to <1hr/day (assuming exercise does double-duty with other intellectual activities and general unwinding)? You could work a pretty intense schedule and still have plenty of time for personal development with that level of overhead, so long as you actually stuck with it and got everything done.

> food prep and washing

I'd be curious to hear more about what this looks like for you. I might have ideas.

▲fatata123 3 hours ago

Get a dishwasher. Don’t rely on a car. There are plenty of life changing effeciency hacks that can free up hours of time a day.

▲rvz 26 minutes ago

Nope. Jane Street engineers just love what they do; and do it for fun in their own time and professionally and they love solving puzzles.

They just think for themselves to make money and not what their manager tells them to do unless either the manager or trader believes they will lose money and the game.

The only relevant difference to you is that you must have zero morals or ethics in this game and it is actually a high stress environment behind the videos and engineering propaganda which is the slender difference between beating or losing to other competitors for their clients.

So really stop worshipping them and fuelling their egos as that is what they (Jane Street) want you to do.