If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better. eg: a human searching a physical paper dictionary can zoom into the right bunch of pages faster than pure idealized binary search; it's a separate matter it's hard for humans to continue binary search till the very end and we might default to scanning linearly for the last few iterations (cognitive convenience / affordances of human wetware / etc).
In mathematical language, searching a sorted list is basically inverting a monotonic function, by using a closed-loop control algorithm. Often, we could very well construct a suitable cost function and use gradient descent or its accelerated cousins.
More generally, the best bet to solving a problem more efficiently is always to use more information about the specific problem you want to solve, instead of pulling up the solution for an overly abstract representations. That can offer scalable orders of magnitude speedup compared to constant factor speedups from just using hardware better.
https://github.com/protocolbuffers/protobuf/blob/44025909eb7...
1. Check for dense list O(1) 2. Check upper bound 3. Constant trip count binary search
The constant trip count is great for the branch predictor, and the core loop is pretty tightly optimized for the target hardware, avoiding multiplies. Every attempt to get more clever made the loop worse and did not pay for itself. It's hard because it's an array-of-structs format with a size of 12, and mostly pretty small N.
I know people train themselves into grokking this and reading and emitting this way, but it sounds like writing "bork bork bork bork" runes to me.
I'm glad Rust feels more like Ruby and Python and that method and field names are legible.
My eyes just glaze over:
UPB_API_INLINE
const struct upb_MiniTableField* upb_MiniTable_FindFieldByNumber(
const struct upb_MiniTable* m, uint32_t number) {
const uint32_t i = number - 1; // 0 wraps to UINT32_MAX
// Ideal case: index into dense fields
if (i < m->UPB_PRIVATE(dense_below)) {
UPB_ASSERT(m->UPB_ONLYBITS(fields)[i].UPB_ONLYBITS(number) == number);
return &m->UPB_ONLYBITS(fields)[i];
}
// Early exit if the field number is out of range.
uint32_t hi = m->UPB_ONLYBITS(field_count);
uint32_t lo = m->UPB_PRIVATE(dense_below);
UPB_ASSERT(hi >= lo);
uint32_t search_len = hi - lo;
if (search_len == 0 ||
number > m->UPB_ONLYBITS(fields)[hi - 1].UPB_ONLYBITS(number)) {
return NULL;
}
// Slow case: binary search
const struct upb_MiniTableField* candidate;
#ifndef NDEBUG
candidate = UPB_PRIVATE(upb_MiniTable_ArmOptimizedLowerBound)(
m, lo, search_len, number);
UPB_ASSERT(candidate ==
UPB_PRIVATE(upb_MiniTable_LowerBound)(m, lo, search_len, number));
#elif UPB_ARM64_ASM
candidate = UPB_PRIVATE(upb_MiniTable_ArmOptimizedLowerBound)(
m, lo, search_len, number);
#else
candidate = UPB_PRIVATE(upb_MiniTable_LowerBound)(m, lo, search_len, number);
#endif
return candidate->UPB_ONLYBITS(number) == number ? candidate : NULL;
}Honestly I don't see much difference between
upb_MiniTable_FindFieldByNumber
and upb::MiniTable::FindFieldByNumber npm -g i package-name
Like why would you teach people to do this? I understand people needed to save precious bytes in the sixties so we have cat and ls but saving 192 bytes or whatever with shorter variable names is not a worthwhile tradeoff anymore.I did not bookmark it and about twice a year I go searching for it again. Some say he’s still searching to this day.
That's why b-trees are the standard in databases. The data could be anything, and its characteristics could massively change at any time, as you suddenly import a whole bunch of new rows at once.
And while you can certainly design algorithms around e.g. gradient descent to try to accelerate lookup, b-trees are already incredibly fast, and have lots of other benefits like predictable worse-case performance and I/O requirements, supporting range scans, ordered traversal, prefix conditions, etc.
So yes, you can certainly design lookup algorithms that are more efficient for particular data distributions, but they will also often lack other important properties. And b-trees are already so fast, improvements are often negligible -- like even if another algorithm produces a closer initial guess, it may be slower to locate the final item, or it may be faster on average but have horrible worst-case performance that makes it unusable.
Even with a paper dictionary, I've always used pretty much a binary search beyond the first initial guess, which only saves you a couple of hops. And actually, once I get to the right handful of pages I'm probably more linear than I should be, and I'd probably be faster if I tried to do a rigorous binary search, but I have to balance that with how long it takes to flip pages.
It is both obvious and profound, the more information you already have, the more information you already have.
[1]: https://dl.acm.org/doi/epdf/10.1145/1411203.1411220
Also if you do not learn anything about the data while performing the binary search, no? Like, if you are constantly below the estimate, you could gess that the distribution is biases toward large values and adjust your guess based on this prediction.
Yes, absolutely!
I forgot to share this general perspective above, and it's too late to edit, so I'll add it here...
Since binary search assumes only monotonicity; splitting your interval into two equal parts extracts one bit of information per step, and any other choice would extract less information on average. One bit of information per step is how you end up needing log(n) steps to find the answer.
To accelerate your search, you basically need to extract those log(n) bits as fast as you can. You can think of that as leveraging both the prior, and everything you learn along the way -- to adaptively design each step to be the optimal experiment to extract maximum amount of information. And adaptive local models of your search space (gradient / hessian / etc) allow you to extract many more bits of information from each query / experiment, provided the function you are inverting has some local structure.
PS: That is why we leverage these ideas to "search" for the optimum, among a space of solutions.
If we would guess that there is a bias in the distribution based on recently seen elements, the guess is at least as likely to be wrong as it is to be right. And if we guess incorrectly, in the worst case, the algorithm degrades to a linear scan.
Unless we have prior knowledge. For example: if there is a particular distribution, or if we know we're dealing with integers without any repetition (i.e. each element is strictly greater than the previous one), etc.
This is true for abstract and random data. I don't think it's true for real world data.
For example, python's sort function "knows nothing" about the data you're passing in. But, it does look for some shortcuts and these end up saving time, on average.
You have another piece of information, you don't only know if the element was before or after the compared element. You can also know the delta between what you looked at and what you're looking for. And you also have the delta from the previous item you looked at.
Actually deciding what to do with that information without incurring a bunch more cache misses in the process may be tricky.
Depending on the problem space such assumptions can prove right enough to be worth using despite sometimes being wrong. Of course if you've got the compute to throw at it (and the problem is large) take the Contact approach: why do one when you can do two in parallel for twice the price (cycles)?
Thinking about it--yeah, if you can anticipate anything like a random distribution it's a few extra instructions to reduce the number of values looked up. In the old days that would have been very unlikely to be a good deal, but with so many algorithms dominated by the cache (I've seen more than one case where a clearly less efficient algorithm that reduced memory reads turned out better) I suspect there's a lot of such things that don't go the way we learned them in the stone age.
You can do better if the list is stable by reusing information.
But gathering that information during searches is going to require great complexity to leverage, as searches are an irregular information gathering scheme.
So create RAM for speedup optimizations up front.
1) Create a table that maps the first 8 bits to upper and lower indexes in the list. Then binary search over the last 8 bits. That reduces the search time in half.
2) Go all the way, and create an array of 32,768 indexes, with all 1's for misses. Either way, search returns O(1).
Stable lists allow for sliding parametric trade offs between RAM-lookup vs. binary search. From full lookup, to full binary.
Never thought about it this way. Brilliant!
IOW your prior on the data distribution lets you skip the first 4-5 binary chops.
Do you mean using a better estimator for the median value? Or something else?
You don't even need priors. See interpolation search, where knowing the position and value of two elements in a sorted list already allows the search to make an educated guess about where the element it's searching for is by estimating the likely place it would be by interpolating the elements.
That's a prior about the distribution, if a relatively weak one (in some sense, at least).
It's an interpolation search. You interpolate the values you evaluated by whatever method you'd like. No one forces you to do linear interpolation. You can very easily fit a quadratic polynomial with the last 3 points, for example.
Interpolation search seems to have a convergence rate of log log n. That's pretty efficient.
Essentially, this means that all loops are already unrolled from the processors point of view, minus a tiny bit of overhead for the loop itself that can often be ignored. Since in binary search the main cost is grabbing data from memory (or from cache in the "warm cache" examples) this means that the real game is how to get the processor to issue the requests for the data you will eventually need as far in advance as possible so you don't have to wait as long for it to arrive.
The difference in algorithm for quad search (or anything higher than binary) is that instead of taking one side of each branch (and thus prefetching deeply in one direction) is that you prefetch all the possible cases but with less depth. This way you are guaranteed to have successfully issued the prefetch you will eventually need, and are spending slightly less of your bandwidth budget on data that will never be used in the actual execution path.
As others are pointing out, "number of comparisons" is almost useless metric when comparing search algorithms if your goal is predicting real world performance. The limiting factor is almost never the number of comparisons you can do. Instead, the potential for speedup depends on making maximal use of memory and cache bandwidth. So yes, you can view this as loop unrolling, but only if you consider how branching on modern processors works under the hood.
For instance, if you have 8 elements, 01234567, and you're looking for 1, with binary, you'd fetch 4, 2, and then 1. With quaternary, you'd fetch 2, 4, 6, then 1. Obviously, if you only have 8 elements, you'd just delegate to the SIMD instruction, but if this was a much larger array, you'd be doing more work.
I guess on a modern processor, eliminating the data dependency is worth it because the processor's branch prediction and speculation only follows effectively a single path.
Would be interesting to see this at a machine cycle level on a real processor to understand exactly what is happening.
Regardless, both kinds of search are O(log N) with different constants. The constants don’t matter so much in algorithms class but in the real world they can matter a lot.
[1] https://lalitm.com/post/exponential-search/ [2] https://en.wikipedia.org/wiki/Exponential_search
HEAD /users/1 -> 200 OK
HEAD /users/2 -> 200 OK
HEAD /users/4 -> 200 OK
...
HEAD /users/2048 -> 200 OK
HEAD /users/4096 -> 404 Not Found
And then a binary search between 2048 and 4096 to find the most recent user (and incidentally, the number of users). Great info to have if you're researching competing SaaS companies.For 4 byte keys and 4 byte child pointers (or indexes in to an array) your inner nodes would have 7 keys, 8 child pointers and 1 next pointer, completely filling a 64 byte cache-line and your tree depth for 1 million entries would go down from ~20 to ~7, the top few levels of which are likely to remain cache resident.
With some thought, it's possible to use SIMD on B-tree nodes to speed up the search within the node, but it's all very data dependent.
It seems to me like there should be a sort order that stores the items as a fully-dense left-shifted binary tree from top-to-bottom (e.g. like the implicit heap in an in-place heap sort, but a binary search tree instead of a hea). Is there a name for this? Does it show any performance wins in practice?
See also https://arxiv.org/abs/1509.05053
In general I love this article, it took what I’ve often wondered about and did a perfect job exploring with useful ablation studies.
The SIMD part is just in the last step, where it uses SIMD to search the last 16 elements.
The Quad part is that it checks 3 points to create 4 paths, but also it's searching for the right block, not just the right key.
The details are a bit interesting. The author chooses to use the last element in each block for the quad search. I'm curious how the algorithm would change if you used the first element in each block instead, or even an arbitrary element.
The outline would be:
a) use a gather to grab multiple elements from 16 evenly spaced locations
b) compare these in parallel using a SIMD instruction
c) focus in on the correct block
d) if the block is small revert to linear search, else repeat the gather/compare cycle
Even though the gather instruction is reading from non-contiguous memory and reading more than you normally would need to with a binary sort, enabling a multiway compare and collapsing 4 levels of binary search should be a win on large tables.
You also may not be able to do a full 16-way comparison for all data types. Searching for float64 will limit you to 8 way comparisons (with AVX-512) but int32, float32 will allow 16 way comparisons.
I bet you a binary search is in fact faster than any gather based methodology.
As soon as you move away from either (or both) of these assumptions then there are likely to be many tweaks you can make to get better performance.
What the classical algorithms do offer is a very good starting point for developing a more optimal/efficient solution once you know more about the specific shape of data or quirks/features of a specific CPU.
When you start to get at the pointy end of optimising things then you generally end up looking at how the data is stored and accessed in memory, and whether any changes you can make to improve this don't hurt things further down the line. In a job many many years ago I remember someone who spent way too long optimising a specific part of some code only to find that the overall application ran slower as the optimisations meant that a lot more information needed later on had been evicted from the cache.
(This is probably just another way of stating Rob Pike's 5th rule of programming which was itself a restatement of something by Fred Brooks in _The Mythical Man Month_. Ref: https://www.cs.unc.edu/~stotts/COMP590-059-f24/robsrules.htm...)
So instead of just comparing the middle value, we'd compare the one at the 1/3 point, and if that turns out to be too low then we compare the value at the 2/3 point.
Unfortunately although we cut the search space to 2/3 of what it was for binary search at each step (1/3 vs 1/2), we do 3/2 as many comparisons at each step (one comparison 50% of the time, two comparisons the other 50%), so it averages out to equivalence.
EDIT: See zamadatix reply, it's actually 5/3 as many comparisons because 2/3 of the time you have to do 2.
- First third: 1 comparisons
- Second third: 2 comparisons
- Third third: 2 comparisons
(1+2+2)/3 = 5/3 average comparisons. I think the gap starts here at assuming it's 50% of the time because it feels like "either you do 1 comparison or 2" but it's really 33% of the time because "there is 1/3 chance it's in the 1st comparison and 2/3 chance it'll be 2 comparisons".
This lets us show ternary is worse in total average comparisons, just barely: 5/3*Log_3[n] = 1.052... * Log_2[n].
In other words, you end up with fewer levels but doing more comparisons (on average) to get to the end. This is true for all searches of this type (w/ a few general assumptions like the values being searched for are evenly distributed and the cost of the operations is idealized - which is where the main article comes in) where the number of splits is > 2.
Not for the ternary version of the binary search algorithm, because what you had is just a skewed binary search, not an actual ternary search. Because comparisons are binary by nature, any search algorithm involving comparisons are a type of binary search, and any choice other than the middle element is less efficient in terms of algorithmic complexity, though in some conditions, it may be better on real hardware. For an actual ternary search, you need a 3-way comparison as an elementary operation.
Where it gets interesting is when you consider "radix efficiency" [1], for which the best choice is 3, the natural number closest to e. And it is relevant to tree search, that is, a ternary tree may be better than a binary tree.
True, but is there some particular reason that you want to minimize the number of comparisons rather than have a faster run time? Daniel doesn't overly emphasize it, but as he mentions in the article: "The net result might generate a few more instructions but the number of instructions is likely not the limiting factor."
The main thing this article shows is that (at least sometimes on some processors) a quad search is faster than a binary search _despite_ the fact that that it performs theoretically unnecessary comparisons. While some computer scientists might scoff, I'd bet heavily that an optimized ternary search could also frequently outperform.
Obviously real-world performance depends on other things as well.
All other people live in the real world, and care about real-world performance, and modern computer scientists know that.
And some of the algorithms, as described, still end up doing pairwise comparisons in all-but-optimal cases.
(Bucket sort requires items that end up in the same bucket to be sorted. This doesn't happen automatically via the algorithm as stated. Radix sort requires the items at each "level" to be sorted. Neither algorithm specifies how this should be done without pairwise comparisons.)
Counting Sort does work without pairwise comparisons, but is only efficient for small ranges of values, and if that's the case then it's obvious you don't need to apply a traditional sort if the number of elements greatly outnumbers the number of possible values.
Also, the algorithms still require some form of comparisons, just not pairwise comparisons.
> All other people live in the real world, and care about real-world performance, and modern computer scientists know that.
Yes, completely agree with that, but traditional "Comp Sci" is built on small building blocks of counting "comparisons" or "memory accesses". It's not designed to analyse prospective performance given modern processors with L1/L2/L3 caches, branch prediction, SIMD instructions, etc.
For years--maybe still?--analyzing its running time was a staple of the first or second problem set in a college-level "Introduction to Algorithms" course.
The pivot in quicksort isn't located anywhere fixed. You can do quicksort by always choosing the pivot to be the first element in the (sub)list before you do the swapping. If it happens that the pivot you choose always belongs 1/3 of the way through the sorted sublist... that won't even affect the asymptotic running time; it will still be O(n log n).
Stoogesort is nothing like that. It's worse than quadratic.
In quicksort, though, once you've done your thing with the pivot, you break the list into two non-overlapping sublists. If it happened that the pivot broke the list into 1/3 and 2/3, you'd then recur with one sublist of length n and one of length 2n.
Stoogesort has a different recursion pattern: you recur on the first 2/3, then you recur on the second 2/3, then you recur on the first 2/3 again.
The processing step isn't similar to quicksort either - in quicksort, you get a sublist, choose an arbitrary pivot, and then process the list such that every element of the list is correctly positioned relative to the pivot, which takes O(n) time. In stoogesort, you process the list by swapping the beginning with the end if those two elements are out of order, which takes O(1) time.
a: [1,3,5,7,8,9,10,15]
x: 8 (query value)
For this array, we would compare a[0], a[3], a[7] (left/mid/right) by subtracting 9.And we would get d=[-7, -1, 7]
Now, normally, with binary search, because 8 > mid, we would go to (mid+right)/2, BUT we already have some extra information: we see that x is closer to a[3] (diff of 1) than a[7] (diff of 7), so instead of going to the middle between mid and right, we could choose a new "mid" point that's closer to the desired value (maybe as a ratio of (d[right]-d[mid]).
so left=mid+1, right stays the same, but new mid is NOT half of left and right, it is (left+right)/2 + ratioOffset
Where ratioOffset makes the mid go closer to left or right, depending on d.
The idea is quite obvious, so I am pretty sure it already exists.
But what if we use SIMD, with it? So we know not only which block the number is in in, but also, which part of the block the number is likely in. Or is this what the article actually says?
Weird that I didn't hear about it before, it's not that used in practice?
One reason I could see is that binary search is fast enough and easy to implement. Even on largest datasets it's still just a few tens of loop iterations.
>> Instead, the optimal strategy for the player who trails is to make certain bold plays in an attempt catch up.
The reason that's optimal, if you're losing, is that you assume that your opponent, who isn't losing, is going to use binary search. They're going to use binary search because it's the optimal way to find the secret.
Since you're behind, if you also use binary search, both players will progress toward the goal at the same rate, and you'll lose.
Trying to get lucky means that you intentionally play badly in order to get more victories. You're redistributing guesses taken between games in a negative-sum manner - you take more total guesses (because your search strategy is inferior to binary search), but they are unevenly distributed across your games, and in the relatively few games where you perform well above expectation, you can score a victory.
However, in a two player setting, using the strategies presented in the paper, you will beat an adversary that uses binary search in more than 50% of the games played.
Here's another visual demonstration: https://www.youtube.com/watch?v=zmvn4dnq82U
> in a two player setting, using the strategies presented in the paper, you will beat an adversary that uses binary search in more than 50% of the games played.
This is technically true. But 50 percentage points of your "more than 50%" of games played are games where you exclusively use binary search. For the remainder, you're redistributing luck around between potential games in a way that is negative-sum, exactly like I just said.
Although I think I get your point, saying 'You can't beat binary search in Guess Who' is misleading, considering you would probably describe yourself the optimal strategy as 'play binary search when ahead, when behind, don't'.
> Trying to get lucky means that you intentionally play badly in order to get more victories
That's quite an uncommon definition of good and bad.
It's absolutely bizarre. Images communicate meaning. Much better to have no image than to have an image that is completely misleading about the target audience or level of technical sophistication.
That is, if you're only ever going to test for membership in the set. If you need metadata then ... You could store that in a packed array and use a population count of the bit-vector before the lookup bit as index into it. For each word of bits, store the accumulated population count of the words before it to speed up lookup. Modern CPU's are memory-bound so I don't think SIMD would help much over using 64-bit words. For 4096 bits / 64, that would be 64 additional bytes.
Obviously, this isn't changing the big-Oh complexity, but in the "real world", still nice to see a 2-4x speedup.
- https://curiouscoding.nl/slides/p99-text/
- https://curiouscoding.nl/posts/static-search-tree/
- https://news.ycombinator.com/item?id=42562847 (656 points; 232 comments)
So not exactly "n" as in O(n).
Also: only for 16-bit integers.
> So not exactly "n" as in O(n).
For large enough inputs the algorithm with better Big O complexity will eventually win (at least in the worst cases). Yes, sometimes it never happens in practice when the constants are too large. But say 100 * n * log(n) will eventually beat 5 * n for large enough n. Some advanced algorithms can use algorithms with worse Big O complexity but smaller constants for small enough sub-problems to improve performance. But it's more like to optimization detail rather than a completely different algorithm.
> This algorithm just uses modern and current processor architecture artifacts to "improve" it on arrays of up to 4096
Yes, that's my point. It's basically "I made binary search for integers X times faster on some specific CPUs". "Beating binary search" is somewhat misleading, it's more like "microptimizing binary search".
I think the title is not misleading since the Big O notation is only supposed to give a rough estimate of the performance of an algorithm.
(I agree though that binary search is already extremely fast, so making something twice as fast won't move the needle for the vast majority of applications where the speed bottleneck is elsewhere. Even infinite speed, i.e. instant sorted search, would likely not be noticeable for most software.)
Yes, algorithmic complexity is theoretical, it often ignores the real world constants, but they are usually useful when comparing algorithms for larger inputs, unless we are talking about "galactic algorithms" with insanely large constants.
Would there be any value in using simd to check the whole cache line that you fetch for exact matches on the narrowing phase for an early out?
> The popular Roaring Bitmap format uses arrays of 16-bit integers of size ranging from 1 to 4096. We sometimes have to check whether a value is present.
There's no claim that this algorithm is universal and performs equally well for other problems.
In fact, note how the compare operation on the data types involved (16-bit integers) is quite cheap for modern CPUs. A similar problem with strings instead of integers would get no benefits from the author's ideas and would actually fare worse, due to useless comparisons per cycle.
Often when an interpolated search is wrong the interpolation will tend to nail you against one side or the other of the range-- so the worst case is linear. By allowing only a finite number of failed probes (meaning they only move the same boundary as last time, an optimally working search will on average alternate hi/lo) you can maintain the log guarantee of bisection.
[...]
The binary search checks one value at a time. However, recent processors can load and check more than one value at once. They have excellent memory-level parllelism. This suggest that instead of a binary search, we might want to try a quaternary search..."
First of all, brilliant observations! (Overall, a great article too!)
Yes, today's processors indeed have a parallelism which was unconceived of at the time the original Mathematicians, then-to-be Computer Scientists, conceived of Binary Search...
Now I myself wonder if these ideas might be extended to GPU's, that is, if the massively parallel execution capability of GPU's could be extended to search for data like Binary Search does, and what such an appropriately parallelized algorithm/data structure would look like... keep in mind, if we consider an updateable data structure, then that means that parts of it may need to be appropriately locked at the same time that multiple searches and updates are occurring simultaneously... so what data structure/algorithm would be the most efficient for a massively parallel scenario like that?
Anyway, great article and brilliant observations!
40x Faster Binary Search - This talk will first expose the lie that binary search takes O(lg n) time — it very much does not! Instead, we will see that binary search has only constant overhead compared to an oracle. Then, we will exploit everything that modern CPUs have to offer (SIMD, ILP, prefetching, efficient caching) in order to gain 40x increased throughput over the Rust standard library implementation.
#include <stdbool.h> #include <stdint.h> #include <arm_neon.h>
/* Author: aam@fastmail.fm * * Apple M4 Max (P-core) variant of simd_quad which uses a key spline * to great effect (blog post summary incoming!) / bool simd_quad_m4(const uint16_t carr, int32_t cardinality, uint16_t pos) { enum { gap = 64 };
if (cardinality < gap) {
if (cardinality >= 32) {
// 32 <= n < 64: NEON-compare the first 32 as a single x4 load,
// sweep the remainder.
uint16x8_t needle = vdupq_n_u16(pos);
uint16x8x4_t v = vld1q_u16_x4(carr);
uint16x8_t hit = vorrq_u16(
vorrq_u16(vceqq_u16(v.val[0], needle), vceqq_u16(v.val[1], needle)),
vorrq_u16(vceqq_u16(v.val[2], needle), vceqq_u16(v.val[3], needle)));
if (vmaxvq_u16(hit) != 0) return true;
for (int32_t j = 32; j < cardinality; j++) {
uint16_t x = carr[j];
if (x >= pos) return x == pos;
}
return false;
}
if (cardinality >= 16) {
// 16 <= n < 32: paired x2 load + sweep tail.
uint16x8_t needle = vdupq_n_u16(pos);
uint16x8x2_t v = vld1q_u16_x2(carr);
uint16x8_t hit = vorrq_u16(vceqq_u16(v.val[0], needle),
vceqq_u16(v.val[1], needle));
if (vmaxvq_u16(hit) != 0) return true;
for (int32_t j = 16; j < cardinality; j++) {
uint16_t x = carr[j];
if (x >= pos) return x == pos;
}
return false;
}
if (cardinality >= 8) {
// 8 <= n < 16: single 128-bit compare + sweep tail.
uint16x8_t needle = vdupq_n_u16(pos);
uint16x8_t v = vld1q_u16(carr);
if (vmaxvq_u16(vceqq_u16(v, needle)) != 0) return true;
for (int32_t j = 8; j < cardinality; j++) {
uint16_t x = carr[j];
if (x >= pos) return x == pos;
}
return false;
}
for (int32_t j = 0; j < cardinality; j++) {
uint16_t v = carr[j];
if (v >= pos) return v == pos;
}
return false;
}
int32_t num_blocks = cardinality / gap;
int32_t base = 0;
int32_t n = num_blocks;
while (n > 3) {
int32_t quarter = n >> 2;
int32_t k1 = carr[(base + quarter + 1) * gap - 1];
int32_t k2 = carr[(base + 2 * quarter + 1) * gap - 1];
int32_t k3 = carr[(base + 3 * quarter + 1) * gap - 1];
int32_t c1 = (k1 < pos);
int32_t c2 = (k2 < pos);
int32_t c3 = (k3 < pos);
base += (c1 + c2 + c3) * quarter;
n -= 3 * quarter;
}
while (n > 1) {
int32_t half = n >> 1;
base = (carr[(base + half + 1) * gap - 1] < pos) ? base + half : base;
n -= half;
}
int32_t lo = (carr[(base + 1) * gap - 1] < pos) ? base + 1 : base;
if (lo < num_blocks) {
const uint16_t *blk = carr + lo * gap;
uint16x8_t needle = vdupq_n_u16(pos);
uint16x8x4_t a = vld1q_u16_x4(blk);
uint16x8x4_t b = vld1q_u16_x4(blk + 32);
uint16x8_t h0 = vorrq_u16(
vorrq_u16(vceqq_u16(a.val[0], needle), vceqq_u16(a.val[1], needle)),
vorrq_u16(vceqq_u16(a.val[2], needle), vceqq_u16(a.val[3], needle)));
uint16x8_t h1 = vorrq_u16(
vorrq_u16(vceqq_u16(b.val[0], needle), vceqq_u16(b.val[1], needle)),
vorrq_u16(vceqq_u16(b.val[2], needle), vceqq_u16(b.val[3], needle)));
return vmaxvq_u16(vorrq_u16(h0, h1)) != 0;
}
for (int32_t j = num_blocks * gap; j < cardinality; j++) {
uint16_t v = carr[j];
if (v >= pos) return v == pos;
}
return false;
}/* * Spine variant, M4 edition. * * pack the interpolation probe keys into a dense contiguous region so the * cold-cache pointer chase streams through consecutive cache lines: * * n=4096 -> 64 spine keys -> 128 B = 1 M4 cache line * n=2048 -> 32 spine keys -> 64 B = half a line * n=1024 -> 16 spine keys -> 32 B * * The entire interpolation phase for a max-sized Roaring container now * lives in one cache line. The final SIMD block check still loads from * carr. * * The num_blocks <= 3 fallback: * with very few blocks the carr-based probes accidentally prime the final * block's lines, which the spine path disrupts. / bool simd_quad_m4_spine(const uint16_t carr, const uint16_t spine, int32_t cardinality, uint16_t pos) { enum { gap = 64 };
if (cardinality < gap) {
// Same fast paths as simd_quad_m4 -- spine is irrelevant here.
if (cardinality >= 32) {
uint16x8_t needle = vdupq_n_u16(pos);
uint16x8x4_t v = vld1q_u16_x4(carr);
uint16x8_t hit = vorrq_u16(
vorrq_u16(vceqq_u16(v.val[0], needle), vceqq_u16(v.val[1], needle)),
vorrq_u16(vceqq_u16(v.val[2], needle), vceqq_u16(v.val[3], needle)));
if (vmaxvq_u16(hit) != 0) return true;
for (int32_t j = 32; j < cardinality; j++) {
uint16_t x = carr[j];
if (x >= pos) return x == pos;
}
return false;
}
if (cardinality >= 16) {
uint16x8_t needle = vdupq_n_u16(pos);
uint16x8x2_t v = vld1q_u16_x2(carr);
uint16x8_t hit = vorrq_u16(vceqq_u16(v.val[0], needle),
vceqq_u16(v.val[1], needle));
if (vmaxvq_u16(hit) != 0) return true;
for (int32_t j = 16; j < cardinality; j++) {
uint16_t x = carr[j];
if (x >= pos) return x == pos;
}
return false;
}
if (cardinality >= 8) {
uint16x8_t needle = vdupq_n_u16(pos);
uint16x8_t v = vld1q_u16(carr);
if (vmaxvq_u16(vceqq_u16(v, needle)) != 0) return true;
for (int32_t j = 8; j < cardinality; j++) {
uint16_t x = carr[j];
if (x >= pos) return x == pos;
}
return false;
}
for (int32_t j = 0; j < cardinality; j++) {
uint16_t v = carr[j];
if (v >= pos) return v == pos;
}
return false;
}
int32_t num_blocks = cardinality / gap;
if (num_blocks <= 3) {
return simd_quad_m4(carr, cardinality, pos);
}
int32_t base = 0;
int32_t n = num_blocks;
// Pull the whole spine into L1 up front. For n in [256, 4096] this is
// 1 line (128 B); for smaller n it is a partial line. Cheap on cold.
__builtin_prefetch(spine);
while (n > 3) {
int32_t quarter = n >> 2;
int32_t k1 = spine[base + quarter];
int32_t k2 = spine[base + 2 * quarter];
int32_t k3 = spine[base + 3 * quarter];
int32_t c1 = (k1 < pos);
int32_t c2 = (k2 < pos);
int32_t c3 = (k3 < pos);
base += (c1 + c2 + c3) * quarter;
n -= 3 * quarter;
}
while (n > 1) {
int32_t half = n >> 1;
base = (spine[base + half] < pos) ? base + half : base;
n -= half;
}
int32_t lo = (spine[base] < pos) ? base + 1 : base;
if (lo < num_blocks) {
const uint16_t *blk = carr + lo * gap;
uint16x8_t needle = vdupq_n_u16(pos);
uint16x8x4_t a = vld1q_u16_x4(blk);
uint16x8x4_t b = vld1q_u16_x4(blk + 32);
uint16x8_t h0 = vorrq_u16(
vorrq_u16(vceqq_u16(a.val[0], needle), vceqq_u16(a.val[1], needle)),
vorrq_u16(vceqq_u16(a.val[2], needle), vceqq_u16(a.val[3], needle)));
uint16x8_t h1 = vorrq_u16(
vorrq_u16(vceqq_u16(b.val[0], needle), vceqq_u16(b.val[1], needle)),
vorrq_u16(vceqq_u16(b.val[2], needle), vceqq_u16(b.val[3], needle)));
return vmaxvq_u16(vorrq_u16(h0, h1)) != 0;
}
for (int32_t j = num_blocks * gap; j < cardinality; j++) {
uint16_t v = carr[j];
if (v >= pos) return v == pos;
}
return false;
}// Build the spine for a given carr. Caller allocates cardinality/64 u16s. void simd_quad_m4_build_spine(const uint16_t carr, int32_t cardinality, uint16_t spine) { enum { gap = 64 }; int32_t num_blocks = cardinality / gap; for (int32_t i = 0; i < num_blocks; i++) { spine[i] = carr[(i + 1) gap - 1]; } }
[0] A nonsense thing to ask people to implement in an interview
You see this in rust, where they replaced the hash tables many years ago, the channel a couple of years ago, and most recently the sort implementations for both stable and unstable sort. I expect other languages / runtimes do similar things over time as well as CPUs change and new approaches are discovered.
It is looking like C++ 26 will get compile time reflection, which would make things like this even more feasible.