That XOR Trick (2020)(florian.github.io)
97 points by hundredwatt 2 days ago | 14 comments
praptak 2 hours ago
Fun fact: the xor swap fails when the variables are aliases. This was the trick used in one of the underhanded code competitions.

Basically xor swapping a[i] with a[j] triggered the evil logic when i was equal to j.

vaylian 45 minutes ago
It would set a[i] to zero instead of swapping two values, right?
analog31 6 hours ago
The first thing that occurred to me is that if a number is missing from a list, the sum of that list will fall short. But I like XOR's.
anitil 6 hours ago
It really tickles my brain in a lovely way that it avoids all overflow risk as well
repiret 4 hours ago
There is no overflow risk. The trick works on any Abelian group. N-bit values form an Albanian group with xor where 0 is the identity and every element is its own inverse. But N-bit values also form an Abelian group under addition with overflow, where 0 is the identity and 2s-compliment is the inverse.

If you’re working on an architecture where a single multiplication and a bit shift is cheaper than N xor’s, and where xor, add, and sub are all the same cost, then you can get a performance win by computing the sum as N(N+1)/2; and you don’t need a blog post to understand why it works.

lblume 1 minute ago
You can also calculate the XOR-accumulation of all values between 1 and n in O(1) using a lookup table like this:

    [n, 1, n+1, 0][n%4]
3 hours ago
analog31 5 hours ago
True, I hadn't thought of that. I'm spoiled by Python. ;-)
akovaski 3 hours ago
The partitioning algorithm to find two missing/duplicate numbers is clever, I wouldn't have thought of that. It should also work if you have a list with 1 missing and 1 duplicate, yeah? You'd probably have to do an extra step to actually find out which number is missing and which is a duplicate after you find the two numbers.

> If more than two elements are missing (or duplicated), then analyzing the individual bits fails because there are several combinations possible for both 0 and 1 as results. The problem then seems to require more complex solutions, which are not based on XOR anymore.

If you consider XOR to be a little bit more general, I think you can still use something like the partitioning algorithm. That is to say, considering XOR on a bit level behaves like XOR_bit(a,b)=a+b%2, you might consider a generalized XOR_bit(a,b,k)=a+b%k. With this I think you can decide partitions with up to k missing numbers, but I'm too tired to verify/implement this right now.

mrbluecoat 5 hours ago
PTSD for me on this topic due to a week wasted cleaning up PHP malware using XOR for obfuscation and encryption: https://www.godaddy.com/resources/news/php-malware-and-xor-e...
hsfzxjy 5 hours ago
To derive "The XOR trick" I think both *associativity* and communitativity are needed.

That is, one should also prove a ^ (b ^ c) = (a ^ b) ^ c. Instinctive, but non-trivial.

ameliaquining 6 hours ago
Ah, my least favorite technical interview question. (I've been asked it, but only after I first read about it online.)
phendrenad2 5 hours ago
Indeed, it kind of feels like asking if someone knows what the number 5318008 means.
motorest 1 hour ago
> Ah, my least favorite technical interview question.

The epitome of turning technical interviews into a trivia contest to make them feel smart. Because isn't that the point of a tech interview?

empiko 58 minutes ago
Is there any other field where they give you random brain teasers for an interview? My friends outside of IT were laughing their heads off when they hears about the usual interview process.
sfn42 46 minutes ago
I've always had reasonable interview questions. Get some data from an API and display it in a table. Make a class that can store car data and get them by plate number. Make a class that calculates tax based on a bracket system.

I haven't even read the article so I don't know what this is about really but if an interviewer seriously asked me about some obscure xor trick I'd laugh at them.

mzs 2 hours ago
I like the 'store prev ^ next' trick for lists that can be walked from the front or from the back.
burnt-resistor 6 hours ago
In ye olden days, bit manip operations were faster than algebraic operations.

And sometimes even faster than a load immediate, hence XOR AX, AX instead of MOV AX, 0.

GuB-42 5 hours ago
"xor ax, ax" is still in use today. The main advantage is that it is shorter, just 2 bytes instead of 3 for the immediate, the difference is bigger in 32 and 64 bit mode as you have to have all these zeroes in the instruction.

Shorter usually mean faster, even if the instruction itself isn't faster.

sparkie 2 hours ago
In long mode, compilers will typically emit `xor eax, eax`, as it only needs 2 bytes: The opcode and modrm byte. `xor ax, ax` takes 3 bytes due to the operand size override prefix (0x66), and `xor rax, rax` takes 3 bytes due to the REX.W prefix. `xor eax, eax` will still clear the full 64-bit register.

Shorter basically means you can fit more in instruction cache, which should in theory improve performance marginally.

Someone 42 minutes ago
Size isn’t everything. You should start by reading the manual for your CPU to see what it advises. The micro-architecture may treat only one of the sequences specially. For modern x64, I think that indeed is the shorter xor sequence, where, internally, the CPU just renames the register to a register that always contains zero, making the instruction independent of any earlier instructions using eax.

IIRC, Intel said a mov was the way to go for some now ancient x86 CPUs, though.

tyfighter 4 hours ago
Modern x86 implementations don't even do the XOR. It just renames the register to "zero".
burnt-resistor 5 hours ago
Barely. x86 is fading. Arm doesn't do this in GCC or Clang.

> Shorter usually means faster

It depends, so spouting generalities doesn't mean anything. Instruction cache line filling vs. cycle reduction vs. reservation station ordering is typically a compiler constraints optimization problem(s).

userbinator 4 hours ago
Arm doesn't do this in GCC or Clang.

Because Arm64 has a zero register, and Arm32 has small immediates, and all instructions are uniformly long.

XeO3 4 hours ago
Apart from these applications of XOR, a favourite one is using Bitwise AND to find Even/Odd numbers.
nullc 6 hours ago
Generalizing an 'xor accumulator' support set difference of more than one element is interesting: https://github.com/bitcoin-core/minisketch
moron4hire 6 hours ago
Why do people hate traditional for loops so much? In a conversation about petty micro optimizations, we end up performing two loops instead of one, all because sticking three operations in one statement is "yucky"?
devjab 26 minutes ago
I think you raise a good question, but Python doesn't have a traditional for loop. To do it in one loop, you'd either have to simulate a traditional for loop with something like range, or you'd have to build a c/zig/rust lib and use it with cffi (or whatever rust uses that I forgot what was named). Or you're going to do it the "pythonic" way and write two loops, probably with a generator. As far as micro optimisation I'd argue that it depends on what you want. Speed or stable memory consumption? The single loop will be faster (for the most part) but the flip side is that there is a limit on how big of a data set it can handle.

It's all theoretical though. On real world data sets that aren't small I don't see why you wouldn't hand these tasks off to C/Zig/Rust unless you're only running them once or twice.

delifue 5 hours ago
Its main benefit is to avoid having extra data structure (like hash map) to find the missing or duplicate, using O(n) time and O(1) space.
moron4hire 5 hours ago
No, again, that's not my point. The code from the article is O(2n) when it could be O(n). I know we're not supposed to care about constant factors, but I've lived in a world where not micro optimizing the ever loving shit out of my software could potentially make people throw up, so this sort of stuff kind of stands out to me.
repiret 4 hours ago
The code in the article is written in Python, whose only for loop is for-each. It is 2N XOR operations, regardless of whether you use one or two loops.

I probably would have written it with a single loop, using the `enumerate` iterator adapter. But in Python, two loops is almost certainly more efficient.

Dylan16807 4 hours ago
You can loop over the range and then do result ^= i ^ A[i]. If adapters are slow you don't need them here.

Having only one loop gives you a better memory access pattern, because it's 2 XOR operations in between each memory access. Two loops is the same number of instructions, but it spends one loop ignoring memory and then another loop doing rapid-fire memory accesses. In python there's enough overhead that it's unlikely to matter. But in a faster language running on a busy machine that could make a real difference.

NoahZuniga 3 hours ago
O(2n) doesn't exist. The whole point of big O is that you ignore such "trivial" things as what factor comes before the n
moron4hire 3 hours ago
Did I not say that?
haskellshill 2 hours ago
>we end up performing two loops instead of one, all because sticking three operations in one statement is "yucky"

You seem to believe that "O(2n)"

  for value in range(1, n + 1):
    result ^= value
  for value in A:
    result ^= value
is slower than "O(n2)"

  for value in range(1, n + 1):
    result ^= value
    result ^= A[value-1]
simply because the latter has one "for loop" less. Am I misunderstanding you, or if not, why would this matter for speed?
a_t48 1 hour ago
Unless both loops get unrolled it's ever so slightly slower due to having to check for the end value twice. Plus potentially a cache hit at the start of the second loop.
codebje 2 hours ago
You did, but it might not be an effective strategy to mention asymptotic complexity to help forward your argument that one linear implementation is faster than another.

Whether it's a win in Python to use one or two loops isn't so clear, as a lot is hidden behind complex opcodes and opaque iterator implementations. Imperative testing might help, but a new interpreter version could change your results.

In any case, if we want to nitpick over performance we should be insisting on a parallel implementation to take advantage of the gobs of cores CPUs now have, but now we're on a micro-optimisation crusade and are ignoring the whole point of the article.

perfmode 5 hours ago
real world performance will depend on how much of that N fits in cache. and in what cache it fits (L1, 2, 3). once loaded, you may not pay much cost to access each value a second time.
MindSpunk 4 hours ago
Doing 2 loops over the data means you have to do a full pass over the data twice. If your N doesn’t fit in L3 then you’re going to load N twice instead of once. Loading twice, even out of L1 is still slower than never loading twice at all.
moron4hire 4 hours ago
Exactly. And there's also the fact that sometimes the data stream we're processing is unbounded and ephemeral. For example, reading values from a physical sensor. It may not match up to this specific example, but the point remains that a single "loop" over a data set might be all you get, so pack as much into that loop as you can.
ToValueFunfetti 4 hours ago
xor(1..n) = switch(n % 4) { case 0: return n; case 1: return 1; case 2: return n + 1; default: return 0; }

So you don't actually need the first loop (at least for the set of integers 1..n example), but bringing that up is probably out of scope for this article.

anitil 6 hours ago
I think it's just an interesting approach to solving particular limited problems. If I needed to solve this I'd end up either using set arithmetic or sorting the list, both of which use more memory and time. Maybe down low in some compiler loop or JVM loop this could be the difference between a sluggish application and a snappy one
moron4hire 5 hours ago
That's not my point. My point is that the exact same code from the original article could be done in a single, traditional for-loop, instead of two for-each loops.
5 hours ago
st0le 7 hours ago
Another fun trick I've discovered.

`XOR[0...n] = 0 ^ 1 .... ^ n = [n, 1, n + 1, 0][n % 4]`

nullc 5 hours ago
Tables yuck :P, maybe

XOR[0...x] = (x&1^(x&2)>>1)+x*(~x&1)

bsdz 23 minutes ago
~Is there a simple proof for this type of identity?~

Actually I found something through Gemini based on the table mod 4 idea in previous post. Thanks.

tialaramex 6 hours ago
Right, or in summary, no you don't need to all that extra work up front.
makeset 2 hours ago
Fun fact: you can show that there is another binary operator that performs the same triple assignment swap.
johnea 7 hours ago
Wow! This is a flashback! Hope you're doing well Andy W!