I know because I stumbled on the same page following the links from the blog of the author of another post that made the frontpage yesterday (https://news.ycombinator.com/item?id=45589156), liked the TernFS concept, submitted it and got redirected to https://news.ycombinator.com/item?id=45290245
If it is decisively better than Lustre, I am happy to make the switch over at my sector in Argonne National Lab where we currently keep about 0.7 PB of image data and eventually intend to hold 3-5 PB once we switch over all 3 of our beamlines to using Dectris X-Ray detectors.
Contrary to what the non-computer scientists insist, we only need about 20Gb/s of throughput in either direction, so robustness and simplicity are the only concerns we have.
Something like this [1] gets you 44 disks in 4u. You can probably fit 9 of those and a server with enough HBAs to interface with it in a 42U rack. 9x44x20TB = not quite 8 PB. Adjust for redundancy and/or larger drives. If you go with SAS drives, you can have two servers connected to the drives, with failover. Or you can setup two of these racks in different locations and mirror the data (somehow).
[1] https://www.supermicro.com/en/products/chassis/4U/847/SC847E... (as an illustration, sas jbods aka disk shelves are widely available from server vendors)
However, you are right. Your bandwidth needs don't really require Lustre.
I'm not joking, I didn't ask this as a way to namedrop my experience and credentials (common 'round this neck o' the woods), I honestly don't know what all the much more competent organizations are doing and would really like to find out.
https://docs.ceph.com/en/quincy/cephfs/index.html
Still not completely decoupled from host roles, but seems to work for some folks. =3
It would make for one heck of a FreeBSD development project grant, considering how superb their ZFS and their networking stack are separately.
P.S. Glad someone pointed this out tactfully. A lot of people would have pounced on the chance to mock the poor commenter who just didn't know what he didn't know. The culture associated with software development falsely equates being opinionated with being knowledgeable, so hopefully we get a lot more people reducing the stigma of not knowing and reducing the stigma of saying "I don't know".
I think the key to making it horizontally scalable is to allow each writable dataset to be managed by a single node at a time. Writes would go to blocks reserved for use by a particular node, but at least some of those blocks will be on remote drives via nvmeof or similar. All writes would be treated as sync writes so another node could have lossless takeover via ZIL replay.
Read-only datasets (via property or snapshot, including clone origins) could be read directly from any node. Repair of blocks would be handled by a specific node that is responsible for that dataset.
A primary node would be responsible for managing association between nodes and datasets, including balancing load and handling failover. It would probably be responsible for metadata changes(datasets, properties, nodes, devs, etc., not posix fs metadata) and the coordination required across nodes.
I don’t feel like I have a good handle on how TXG syncs would happen, but I don’t think that is insurmountable.
All I know is that the semantics of RDMA (absent experience writing code that uses RDMA) deceive me into thinking there's some possibility I could try it and not end up regretting the time spent on a proof of concept.