However, you should consult the book "C: Interfaces and Implementations" by Dave Hanson, which has a library called "Arena" that is very similar to your "Pool"s, and it shows a few more tricks to make the code safer and easier to read (Chapters 5+6, 6 in particular).
D. Hanson's book (written in the 'literary programming' style invented by Donal Knuth) also demonstrates having debugging and production implementations for the same C API to catch memory errors, and his book is from 1997:
> Copyright © 1997 by David R. Hanson
(He used to be a SE professor I Princeton before getting hired by Microsoft, I believe.)My bucket list includes an item to write a novel efficient implementation of malloc/free. I have taken several stabs at this in the past but have never been happy with it. I have no real need of a custom allocator but it is something that I think would be very cool to have under my belt. This was inspired by K&R.
The arena allocator in Hanson's chapter 6 is an arena allocator, not a "pool allocator" in the sense that 8dcc means, which is to say, a constant-time fixed-size-block allocator. It seems that you had the same reading comprehension problem that I did.
The difference is that Hanson's arena allocator can make allocations of any size (his Arena_alloc takes an nbytes argument) and cannot deallocate individual allocations, only the entire arena (his Arena_free takes only an arena pointer as a parameter). A fixed-size-block allocator like the one we are discussing here can only make allocations of a fixed size, but can recycle them individually, not only all at once.
(If Hanson's name sounds familiar, it might be because you've read the lcc book.)
It may be worth noting that Donald Knuth also invented the font used in 8dcc's article.
Both programs and journal papers are mostly written the way they are because of tradition, not because it's the best way we've found to write them. Two examples:
Nobody would intentionally design a programming language to make it impossible to include the kind of diagrams that illustrate this article, but most programming languages are in fact designed in such a way.
Buckminster-Fuller-style "ventilated prose" turns out to be more readable than traditional paragraph formatting in psychology experiments, which is why we use it universally in our programs. But you can't publish a paper formatted that way unless you call it a poem.
I like that there is a more or less exhaustive blog post about it which happens to look nice as well.
https://developers.redhat.com/articles/2022/03/23/use-valgri...
https://github.com/google/sanitizers/wiki/addresssanitizer
I believe to integrate a custom allocator with ASAN, you need to look at the memory "poisoning" mechanisms:
https://github.com/google/sanitizers/wiki/AddressSanitizerMa...
It's not always the case that you want special behavior for the garbage collector to assist with debugging, but also this: if you want to use Valgrind to debug anything that does not have to do with the garbage collector, you want all the false positives generated by it out of your way.
Some of the scanning that a conservative garbage collector has to do looks like access errors to Valgrind. By using the Valgrind API, you can grant yourself access to that memory, and then lock it again when done.
That aspect doesn't necessarily apply to a pool allocator being described in this article; that should not be triggering any false positives.
> I integrated the TXR Lisp garbage collector with valgrind.
That's funny, because when I said "for other projects I am working on", I specifically meant a Lisp interpreter which uses a memory pool and garbage collection (I am still working on implementing this last part, anyway). I am very happy about the fact that I can continue using valgrind in this project.
[1] https://github.com/google/sanitizers/wiki/AddressSanitizerMa...
So, for example, if you are willing to limit yourself to 64 KiNodes of equal sized objects, a pool like https://github.com/c-blake/bst/blob/main/POOL.c can use an unsigned short 2-byte pointer. This small address space specialization can give a full 4x space overhead optimization over modern 64-bit pointers.
You could even use 8x less with 1-byte pointers if 256 nodes is enough or use bit fields or equivalents for even more fine-grained address space size selection. Admittedly, linked structures are usually less interesting for very small address space sizes, but this is sometimes useful. People may complain about arbitrary limits that relate to this kind of optimization, but the linked part could be "small by design" without compromising overall scale { such as for structure within a B-tree node or other multi-level indexing setups }.
In any event, I think it is useful to highlight pointer size arbitrariness to junior engineers. It is often learned once & immediately forgotten behind an onslaught of general purpose-ness (& pointer type systems/syntax). Relocation ideas also relate. E.g., if all the pointers are implicitly indices into a (small?) array, and all the code needs a "base address" of that array to get a VMem address, then you can relocate the whole bundle of objects pretty easily changing only the base address.
¹ https://floooh.github.io/2018/06/17/handles-vs-pointers.html
Do notice that it means runtime bounds checking. It also mentions the problem of having a handle reference an object that’s no longer there or that has changed and that the probability of this happening rises with time. This pattern works great in single thread/single process code but starts decaying hard when you have preemptive concurrency. Here you really cannot safely do much of anything without some sort of locking without copy on write semantics (which is also the case with a lot of other systems, just pointing out that this isn’t solved magically by handles though the additional metadata can help detect use-after-free and handle-now-refers-to-a-different-object problem).
So the age old objection that bounds checking slows you down but is necessary to detect problems at runtime is here. You won’t get branch-free code with way compared to straight pointer references but on the other hand you get a lot closer to the concepts of owners and borrowers in straight C which is cool.
Anyway, you probably knew all that & I don't mean to suggest otherwise, but it made sense to write since HN loves PL talk.
Paper: LLFree: Scalable and Optionally-Persistent Page-Frame Allocation https://www.usenix.org/system/files/atc23-wrenger.pdf
Video presentation: https://www.youtube.com/watch?v=yvd3D5VOHc8
Implementation and benchmarks are well documented at the repos:
Rust repo https://github.com/luhsra/llfree-rs C repo https://github.com/luhsra/llfree-c
- Added a link to this HN post.
- Renamed some variables and types in the code for readability.
- Mention (in a footnote) that we could allocate the 'Pool' structure and the chunk arrays with a single call, or even return the 'Pool' on the stack. Suggested by 'liontwist' and 'unwind'.
- Mention (in a footnote) that we don't always need to store the full address when building the linked list of free chunks. Suggested by 'cb321'.
- Added valgrind support (also to my 'libpool' project). Suggested by 'kazinator'.
This can really add up. For example, like 20 years ago, I did a `print sizeof` in gdb for some g++ STL map node with a data payload of like 4 bytes and it was some obscene size approximately >= 1 order of magnitude bigger than necessary (like 80 bytes instead of, say 8 with 2Byte ptrs). While main memory is cheap, CPU cache memory is not and for "linky" structures, it often pays to be more cache-friendly. It's pretty easy to imagine a 10X smaller thing fitting into an L2 CPU cache where the larger thing would not even fit in L3, netting you a big latency boost (though the more root-ward levels of the tree likely always stay in caches in a hot loop).
EDIT: Anyway, you don't have to believe me. Others did an x32 ABI for x86_64 mostly to have "merely 2X" smaller pointers: https://en.wikipedia.org/wiki/X32_ABI
So i ended up with this : https://github.com/jinie/memmgr
[1]: http://dmitrysoshnikov.com/compilers/writing-a-pool-allocato...
[2]: http://dmitrysoshnikov.com/compilers/writing-a-memory-alloca...
This kind of allocator is often used to suballocate GPU memory in game and graphics applications.
I'm using a variant of this algorithm with added support for shrinkable and aligned allocations and flexible bin sizing.
You can also extend this idea to two dimensions to create texture atlases, which is possible in O(1) for power of two sized allocations.
Original: https://github.com/sebbbi/OffsetAllocator Rust port: https://crates.io/crates/offset-allocator
Also sometimes useful to provide another call to promote an automatic item to a more permanent one so that it is excluded from being freed like this.
(on windows, you'd need to handle this with VirtualAlloc. osx and linux let you overcommit with malloc, and handle it with a copy-on-write page fault handler)
A good reference on strict aliasing: https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...
malloc() returns an address. The user code writes a value, and then reads a value of the same type. No problem.
Later, malloc() returns the same address. The user code writes a value of another type then reads a value of that same type. Again, no problem. There's no aliasing going on here.
The act of writing a value of a different type tells the compiler that the lifetime of the previous object has ended. There's no special magic required.
Strict aliasing is about situations where we write a value of one type, and then attempt to read a value of a different type from the same location. If you want to do that, then you have to be extremely cautious. But memory allocators don't do that, so it's not an issue.
(Obviously just talking about C here, or POD in C++ terms. If you are dealing with C++ objects with destructors, then you have an extra layer of complexity to deal with, but again that's nothing to do with aliasing.)
afaik only memcpy has that magic property, so I think parent is almost correct.
void *p = malloc(n);
*(int *)p = 42; // ok, *p is now an int.
//*(float *)p = 3.14f; // I think this is not allowed, p points to an int object, regular stores do not change effective type
float x = 3.14f;
memcpy(p, &x, sizeof(float)); // but this is fine, *p now has effective type float
So in the new, pool_new: pool->chunk_arr[i].next = &pool->chunk_arr[i + 1];
This sets the effect type of the chunk block to 'Chunk'Later in pool_alloc:
Chunk* result = pool->free_chunk;
...
return result;
result has effective type 'Chunk'In user code:
int *x = pool_alloc();
*x = 42; // aliasing violation, *x has effective type 'Chunk' but tried to access it as an int*
User code would need to look like this: int *x = pool_alloc();
memcpy(x, &(int){0}, sizeof(int)); // establish new effective type as 'int'
// now we can do
*x = 42;**
C should issue a defect report and get rid of that nonsense from the standard.
This enables security analysis like valgrind/ASan and secure hardware like MTE/CHERI so it's very important and you can't get rid of it.
However, it's not possible to implement malloc() in C because malloc() is defined as returning new "memory objects" and there is no C operation which creates "memory objects" except malloc() itself. So it only works as long as you can't see into the implementation, or if the compiler gives you special forgiveness somehow.
C++ has such an operation called "placement new", so you want something like that.
It gets complicated when you have virtual memory and an OS involved but even then you can override the system malloc with a simple implementation that allocates from a large static array.
(The difference here is that system malloc() works with valgrind, -fbounds-safety, theoretical secure hardware with bounds checking etc., and this one doesn't.)
This doesn't make any sense. How do you know its effective type if you don't have access to the definition of `pool_alloc()'?
It's really quite ridiculous that compiler implementers have managed to overzealously nitpick the C standard so that you can't implement a memory allocator in C.
This is good because it's also what gives you valgrind and CHERI. Take one away and you can't have the other.
(Because if you define all undefined behavior, then programs will rely on it and you can't assume it's an error anymore.)
I think strictly speaking in C this is only true for anonymous memory, i.e. memory you got from some allocator-like function. So if your custom allocator gets its memory from malloc (or sbrk, or mmap), everything is fine, but, for example, you are allocating from a static char array, it is formally UB.
In C++ you can use placement new to change the type of anything.
C++ relatively has fixes to allow allocators to work, it requires calls to std::launder.
If you look at 'pool_free' code you can see that it receives used block from user as 'ptr' then casts it to 'Chunk pointer' and writes to into Chunk member value of type 'Chunk pointer'. I had to change that line to memcpy when I did it last time around 15 years ago in C++ with GCC 4.something. In short if you are writing your own allocators you need either to disable strict aliasing like Linux does or do the dance of 'memcpy' when you access memory as 'internal allocator structures' right after it was used as user type. When it happened to me write was reordered and I observed code that writes into Chunk* next executed first and write of zeros into user type second.
It is trivial to co-allocate them which removes the risk of memory fragmentation and is just generally a good idea if you're chasing performance.
The answer might be "for clarity/simplicity" which I guess is fine in an informative article, but it could at least be highlighted in the text which I didn't see.
A C11 implementation could go one step further and use _Alignas(max_align_t) to keep the pool array aligned with no manual effort. The double allocation does this implicitly.
This issue is sidestepped in TFA by using a union, which ensures appropiate alignment for all it's members, but in the library there's nothing which prevents me from asking for a chunk size of, say, 9 bytes, which would store pointers at misaligned addresses when creating the free-list.
Admittedly, sometimes for good reason - it locks in the ABI.
The dynamic constructor API can allocate the structure exactly sized without the padding.
It’s a performance oriented custom memory allocator.
An alternative is to make the only member `char opaque[16]` (IIRC some locale-related thing in glibc does this, also I think libpng went through a transition of some sort related to this?), but calculating the size can be difficult outside of trivial cases since you can't use sizeof/offsetof.
- thread safety ( not too hard to add on top of your liked list) - variable sized allocations. Ideally something more akin to malloc. Could be done wastefully by rounding up to the nearest chunk size - (as mentioned in the article) variable chunk sizes - zeroing of memory before handing it back to users - double free detection or even better full asan support
I think it would be improved further if it began with the calling interface callers use to invoke the allocator, because it's easier to understand the implementation of some service such as pool allocation when you begin by knowing what service, exactly, is required. I began with the assumption that the author was describing an arena allocator under another name, rather than a fixed-size block allocator, both because the APR's influential arena allocator is called "apr_pool_t" and because one of the main headings in the table of contents (a greatly appreciated feature, like all the typography of the article) was "Closing the pool". It took me quite a while to figure out that I was mistaken. It's plausible that I only had this problem because I am an unusually stupid, ignorant, or pigheaded person (some would say all three), but, if not, many other people will have the same difficulty I had of taking several minutes to figure out what the topic of the article is. (I could have saved myself by clicking either of the first two links in the Introduction, but I didn't because I thought I already knew what it was.)
The current top-voted comment on this thread is by someone who had the same reading comprehension problem I did, but never realized their erorr: https://news.ycombinator.com/item?id=42643951 and nobody had pointed out their mistaek until I did just now. So I think we can say with some confidence that many HN readers will have the same difficulty.
I think that the article would also be improved by a few other small additions:
- It says, "The pool allocator, however, is much faster than malloc", but doesn't provide any benchmarks or even an order of magnitude.
- Probably when one benchmarks the implementation given, one will find that it is only slightly faster than jemalloc or even gnumalloc, because they both dispatch small allocations to a fixed-block allocator similar to this one, and the dispatch isn't very expensive. This can be fixed by inlining the allocation fast path and the deallocation function, which will then compile down to a few instructions each. A simple demonstration of how to do this is in my arena allocator kmregion in http://canonical.org/~kragen/sw/dev3/kmregion.h and http://canonical.org/~kragen/sw/dev3/kmregion.c (GPLv3+). The current implementation of 8dcc libpool is guaranteed to be unable to do this unless you're using link-time optimization or a so-called "unity" or "amalgamation" build process.
- It's worth distinguishing between average-case performance (in which this allocator is slightly better than malloc) and worst-case performance (in which case malloc, generally speaking, isn't even in the running).
- It may be worth mentioning that often such pool allocators are used in conjunction with initializing the objects in the pool to a valid state when the pool is created; that way you don't have to reinitialize objects to a valid state every time you deallocate and reallocate them, which is usually more work than malloc does to find the right free list. This does require not overwriting the user data with your freelist pointer, so it's a space/time tradeoff.
- Maybe you should mention alignment, especially now that even on amd64 GCC has taken to using new instructions that require alignment.
Even as it is, though, it's highly educational, visually pleasant, and very understandable (once I got over my initial misconception of having a totally wrong idea of what the article was about).
Yes, this is true. Furthermore, the diagram information for draw.io is included in the SVG images themselves, so one could edit them there.
> It says, "The pool allocator, however, is much faster than malloc", but doesn't provide any benchmarks or even an order of magnitude.
You are right, I was referring to "malloc-like" allocators (i.e. for variable-size blocks). It would be a good idea to benchmark them.
Given the depth of that rabbit hole, as writing advice, I might suggest instead motivating via "more space efficient" and refer to exactly fitting fixed-size objects where more general purpose allocators will (usually)¹ waste space to the next power of two. Measuring less dynamic space is also easier (very dynamic space like %cache used can be tricky, though). Controlling scope and pedagogy in general is also tricky, but I think this rhetorical advice is common practice and might also double as a way to address your initial misread of the topic. EDIT: he can always do refinements / more analysis in follow up posts/something if this library has any legs for him.
[1] https://github.com/c-blake/bu/blob/main/doc/tim.md
[2] https://github.com/c-blake/bu/blob/main/doc/edplot.md
¹ Sure, fancy any-size allocators can tack on a histogram of maybe somewhat rounded lg sizes (to economize histo space) and spawn fixed-size object sub-arenas if there are enough of them, say >= enough to fill a VMem page or 2, but this is rare in my experience and also complicates the dispatch to subarenas that you mention above to slightly fancier search. This kind of topic rapidly spirals into allocator-complete discussion and even systems-complete - e.g. a filesystem is largely an allocator with good old MS-DOS FAT being close to a "file is a pool of disk blocks" model.
Benchmarking is a full-fledged scientific experiment. You have a hypothesis that a particular change that you can make, such as using a different allocator, will have a particular effect, such as making your program run in less time. So you make just that one single change and measure the results, while carefully controlling all other conditions that could also affect your program's run time. You have to do the test more than once in order to find out whether you've succeeded in controlling them. If you do a good enough job controlling the conditions of the experiment, it becomes reproducible and therefore falsifiable. Programming usually isn't science, but benchmarking is. A great thing about it is that each run of the experiment can take a fraction of a second, so you can do the whole scientific method from beginning to end in a few minutes and get a reproducible, falsifiable result.
Benchmarking can get arbitrarily sophisticated, but the most important thing is to dodge the bullets that can turn benchmarks into nonsense, rather than produce very sophisticated graphs or get very high precision.
The simplest thing on Linux or macOS is to compile a command-line executable you can run in a terminal and run
time ./testprogram
at least three times to get an idea of how long the wallclock time is and how variable it is. (The user and system times shown are not very reliable anymore.) Presumably there is also some way to do the same thing on Microsoft Windows. If you're on a laptop, check to see if it makes a difference if you're plugged in. Also, check whether the result from time ./testprogram; time ./testprogram
is consistent; sometimes lags in CPU speed scaling can produce unpredictable results.Linux is not very consistent, so you should expect variations in the range of 1–10% from run to run when your program takes on the order of a second to run.
If you get consistently shorter wallclock times after recompiling the program with a different allocator and the same compiler options (probably make sure optimization isn't completely turned off), you can probably attribute that difference to the different allocator. It's a good idea to switch back to the original allocator and verify that you can reproduce your original results to make sure you didn't accidentally change something else at the same time (maybe your laptop went into low-power mode because the battery is getting low, or maybe you changed the number of iterations in a loop and forgot you changed it, or maybe Firefox loaded a YouTube video in the background and is making the machine swap, that kind of thing).
It's useful to ensure that you aren't swapping and that your CPU load other than the benchmark is minimal. `top` (or, better, `htop`) and `vmstat 5` (or, better, `dstat 5`) can be helpful here. Unless your program is heavily multithreaded, the CPU load is not as crucial as it used to be now that we all have multicore processors.
It's helpful to check the particular version of the program you're benchmarking into Git, including the Makefile or whatever specifies the compiler options. That way when you write down your results you can refer to the specific Git commit hash. Also, record your compiler version, your operating system version, and your CPU. Ideally someone who isn't you should be able to read your notes on the benchmark, run it again, and get the same results, if they have access to the same hardware.
The actual benchmark program itself can be as simple as allocating and deallocating in a loop, ideally inside another loop to run the whole benchmark multiple times; but on processors with cache (which includes every PC since the early 90s) the working set size can be very important, and you may get significantly different results for allocating 20 kilobytes 64 bytes at a time and 20 gigabytes 64 bytes at a time. If you pass in the number of iterations for one of the loops on the command line, instead of hardcoding it in the program code, it's easy to start with a small iteration count and work your way up a factor of 10 at a time until the program takes a few seconds to run. In this case, also be sure to record the specific command line you were measuring the performance of in your notes; knowing that a program took 2.205–2.216 seconds to run is of no value if you don't know if it was running a thousand iterations or a million.
It's a good idea to actually compute something that you output using the memory you're allocating and deallocating, just in case GCC's optimizer decides that your benchmark loop has no effect and therefore can be safely removed from your program. (In http://canonical.org/~kragen/sw/dev3/kmregion_example.c I computed the sum of the data value stored in each node of the 24th of all the 5000-node linked lists I allocated and deallocated. That's because, when I read the disassembly of my test code with `objdump -d kmregion_example`, I found out that GCC had removed most of the code inside the test loop because it could prove that nobody ever read the struct fields I was initializing. Reading disassembly is neither necessary nor sufficient for dodging all bullets that invalidate benchmarks.)
Varying the number of iterations in both the innermost and outermost loop should produce roughly proportional increases and decreases in wallclock time. (A helpful technique is to subtract the time for 1000 iterations from the time for 1001000 iterations to get the time for the last million iterations.) If it doesn't, you're not measuring what you think you're measuring. For example, especially with JIT compilers, you may be measuring startup overhead. That bit me this week with http://canonical.org/~kragen/sw/dev3/hadamardsin.js, which I erroneously reported was slower to run than the LuaJIT version. It turns out that it's slightly faster to run, just much slower to compile.
Running the benchmark on more than one computer is helpful because sometimes changes that speed things up on one system slow them down on another. I remember a recursive backtracking search program for logic circuit optimization I wrote in C long ago which used longjmp() to terminate the search when it found a solution; it ran reasonably fast on Linux but very slowly on my friend's Macintosh because MacOS implemented setjmp() as sigsetjmp(), transforming it from a very cheap operation into a very expensive one.
That's all you need to know to reliably do optimizations. In fact, that's probably more than you need to know. Everything below this point is extra.
— ⁂ —
Allocators are especially tricky to benchmark because often, by changing where data is stored in memory, they profoundly affect the speed of the rest of your program, in ways that vary from program to program. More memory fragmentation, for example, can increase how many TLB misses your program runs into. I don't know what to do about that except try to benchmark some real programs as well as simple nested loops.
In cases where you're trying to measure an effect that's roughly the size of your measurement precision, statistical tests can be useful. https://github.com/c-blake/bu/blob/main/doc/edplot.md linked by cb321 above discusses using the 2-sample Kolmogorov-Smirnov test, which is not nearly as terrifying as it sounds. But most effects will be either much larger than your measurement error, in which case the result is obvious enough that you don't need the statistics, or much smaller, in which case all they'll tell you is that you can't be sure about the direction of the effect, much less its magnitude. In medicine, where your experiments take years and people die in them, using sophisticated statistics to make the most of your precious data is very important. In performance benchmarking, where your experiments take milliseconds, it's usually better to improve your measurement precision instead. Sometimes just running the same benchmark more times is enough, but there are more powerful techniques available.
A way to get higher precision than `time` can give you is to run the program under `valgrind` (on Linux). Even raw valgrind will tell you exactly how many instructions it ran, and in most cases that number is the same from run to run, so instead of a measurement with ±3% precision like `time` usually gives you when you've controlled everything properly, you get a measurement typically with ±0.00000001% precision. This is fantastic, far beyond most things you can do in a physics, electronics, or chemistry lab. It means that even with just two runs you can tell if your optimization affected that number by even 0.0000001%, or, more interestingly, 0.1%. And it isn't affected by things like which CPU you run it on or whether your laptop is unplugged, and it only slows the program down about 10×.
Unfortunately, it measures the wrong thing, because a program that runs more instructions can still be faster. `valgrind --tool=cachegrind` tries to simulate your CPU's cache hierarchy to give you a better idea of how long the program will actually take; the `kcachegrind` GUI can give you a "cycle estimation" number based on the numbers it spits out. But it can still be unrelated to how fast the program actually runs for a variety of reasons; system calls are the biggest one. `kcachegrind` is also a "profiler": it shows you how much time (simulated instructions, simulated cache misses, cycle estimation, whatever) is used by each subroutine in your program.
You might wonder why an optimization that speeds your program up by 0.1% matters. The answer is that if you do 106 such optimizations you have sped your program up by 10%. SQLite has improved its performance by more than a factor of 3 over a 10-year period by using this approach: https://www.sqlite.org/cpu.html
There's a more basic profiler called gprof built into GCC; building your program with `-pg` will add profiling instrumentation in to it, so that it writes profiling information into a file called `gmon.out`. After running a program built with gprof, you can run `gprof testprogram` to analyze `gmon.out`; it will tell you how much time each subroutine took (including and excluding its transitive callees) and how many times it was called.
Linux also has a built-in profiler called `perf_events` https://perfwiki.github.io/main/ which uses hardware performance counters to get numbers like cachegrind's, but using the real hardware instead of simulation, and including a lot more detail; it can also profile the time your program spends in system calls. I've only used it in very basic ways, so I don't know very much about it. Intel also has a proprietary thing called VTune which can do things perf can't and runs on OSes that aren't Linux, but doesn't support AMD's processors.
— ⁂ —
I hope this is of some value to you! I'd be happy to answer any other questions. Hopefully correctly.
> The actual benchmark program itself can be as simple as allocating and deallocating in a loop [...]
What do you mean, exactly? Deallocating immediately after allocating? I feel like there would be a big difference between that and allocating multiple blocks before freeing them.
While that is true (and I've said it myself), one reason I like the design of a little "workload interpreter shell" as in https://github.com/c-blake/bst/blob/main/test/bst-interact.c (or maybe better https://github.com/c-blake/adix/blob/master/tests/btshell.ni... which has a preprocessor) is that you can use them to run "isolated tests" as a kind of "half step" between a pure hot loop dumb thing and a full-on application with all its attendant noise and self-interaction/self-disturbance.
So, for example, in this allocation problem setting you could write a similar 30-line "shell/interpreter" that interprets a binary stream of "commands" or allocation requests. The origin a of said binary stream could be a recording/log from some actual app you have doing something actually useful. Then you could apply that one recording in the "shell" in different configurations as @kragen mentions to study the allocator performance (both space & speed) in isolation. At that point you could start to say pretty meaningful things from replaying logs many times about this or that tradeoff on this|that workload.
EDIT: Of course, at that point you start to get into debates about "how representative" different workloads are, but at least you could potentially share your log with someone who had a different computer and have them run benchmarks, too, and at least the workload is not synthetic but relates to some actual possible program. Moving from synthetic benchmarks to log-based ones was a big thing in OS research on filesystems in the 1970s & 80s and it helped the "generalizability of results" (I think, anyway..I'm sure many still do only synthetic benchmarking due to its convenience).
EDIT2: I just thought of a cute name for the log replay half-step between synthetic microbenchmarks and a full application - "The millibenchmark". ;-) Maybe someone else has thought of this before, though.
Recording traces of allocation requests from actual applications for allocator-benchmarking purposes revolutionized allocator performance research in the late 90s, if I recall correctly, in addition to the earlier filesystem research you mentione. Benchmarks based on those traces were the bedrock of Zorn's case that generational GCs were faster in practice than malloc().
But I think (though possibly this is motivated reasoning) that benchmarking some workload is good enough for many optimizations. A fixed-size-block allocator is already pretty restricted in what workloads you can run on it, after all, and its performance characteristics are a lot simpler than something like first-fit. Showing that an optimization makes some workload faster is infinitely better than not knowing if it makes any workload faster. Showing that it generalizes across many workloads is better, of course, but it's not nearly the same step in utility.
You could probably set up a https://github.com/c-blake/nio file for it (or other) if you wanted to be systematic. A problem with the "text-newline" part of the "unix philosophy" is that it incurs a lot of unneeded binary->ascii->binary cycles that's actually more programming work as well as more CPU work, and it's really fairly orthogonal to the rest of the Unix ideas.
EDIT reply to parent EDIT
> benchmarking some workload is good enough
Fair enough, except that "good enough" is sort of "famous last words" if anyone is reviewing your work. :-) :-) Some will complain you only did Intel not AMD or not ARM or not xyz or Apple Silicon or whatever other zillion variables. Something is always better than nothing, though. It should be thought of more like a Bayesian thing - "no surprise for untested situations" or maybe "confidence in proportion to evidence/experience".
FWIW, I thought your big advice to 8dcc was pretty decent.
Like, yeah you have to do that, but I thought the whole point of writing a custom allocator was to not use the system allocator beyond statically allocating a huge block of memory upfront and then managing that yourself, is it not?
That's the point of using GPL
For the `LinkedPtr` list, the bad case is only hit during the destruction of the pool, and then only if you allocate a lot of memory. And given the overhead of deallocation I'm not sure the array approach would measurably help.
I don't see anywhere a binary tree search would be useful here, since there are no loops used for lookup (on allocation they're pushed in order, but when freed chunks are pushed in arbitrary order; this does mean no double-free protection).
More sophisticated strategies bucket allocations by size, this has fixed overhead. You can also use balanced trees for more memory efficiency, but this is slower.
For small allocations (8 bytes) that are too small to contain pointers allocators will allocate a block and use bitsets.
The initial linking together of all free space (also cache friendly, btw) is often called "threading the free list", although this terminology is less standard than "free list" itself.
The only thing that I can imagine being faster is a bump allocator, but that precludes free.