Faster addition and subtraction on modern CPUs

Do you remember how to do long addition on paper?

 ¹¹ ¹
  6876
+ 3406
------
 10282

Starting from the “ones” position, we add 6 + 6 = 12, write down a 2 and carry a 1. We proceed to the left, one position at a time, until there are no more digits to add.

When implementing addition for large integers (e.g. 264 and above), it’s common to write code that looks quite similar to this algorithm. Interestingly, there’s a straightforward trick that can speed up this process enormously on modern CPUs.

But first, a question: why do we start long addition with the “ones”? Why not start on the left?

The answer, of course, is the carries. We can’t figure out for sure what a given digit of the answer will be until we’ve completed all of the additions to the right of that digit.

Imagine if we tried to add left-to-right instead:

6 + 3 = 9. So the first digit is 9.
8 + 4 = 12. OK, the second digit is 2… but carry a 1, so the first digit was actually 9 + 1 = 10… now carry back that 1…

For mental math, this isn’t too bad (and some people actually prefer it when working with small enough numbers). As an algorithm, however, this approach has some fundamental limitations that become clear when working with larger numbers. Most importantly, because the later parts of the computation rely on information from the earlier parts of the computation, it’s hard to split up and parallelize the work.

What about computers?

Computers don’t work in base 10, of course. Instead, modern desktop and server CPUs expose an interface for operating on (for the most part) 64-bit integers.

; Add the 64-bit value in B to the 64-bit value in A
add A, B
; Note: I'll use letters instead of real register names to keep things simple

As long as our numbers fit within a single 64-bit value, things are easy. But what if we want to add, say, two 256-bit integers, x and y?

The obvious solution would be to break up each 256-bit number into four 64-bit pieces (commonly referred to as “limbs”). Place the highest 64 bits of x into register A, the next 64 bits into register B, and so on for registers C and D. Do the same for y with registers E, F, G, H.

Now we can add x and y by adding the corresponding parts:

; Equivalent to x += y
add A, E
add B, F
add C, G
add D, H

But wait, this might give us the wrong result! If one of the last three additions overflow, then we need to “carry” that extra 1 up to the next 64-bit piece. Oh hey, does that sound familiar?

Fortunately, x86 has a dedicated instruction for this called “add with carry”. adc will automatically check if the previous operation overflowed, adding 1 if needed. Here’s how the proper code would look:

add D, H
adc C, G ; include carry from previous op
adc B, F ; include carry from previous op
adc A, E ; include carry from previous op

Just like with long addition in base 10, we start with the least-significant “digits” (D and H) and work our way up to the most-significant “digits” (A and E), carrying 1s as needed along the way.

But now it’s slow

Interestingly, our fixed code is slower than the original (incorrect) code. Much slower. Why is this?

The first reason is that adc is just slower to execute than a normal add on most popular x86 CPUs. Since adc has a third input (the carry flag), it’s a more complex instruction than add. It’s also used less often than add, so there is less incentive for CPU designers to spend chip area on optimizing adc performance.

The second reason is more interesting. Let’s look at the Intel Haswell microarchitecture as an example.

On a Haswell CPU, a single add instruction takes 1 cycle to execute. However, in ideal conditions, Haswell CPUs can execute up to 4 add instructions in a single cycle. How is this possible? Parallelism. Modern processors look ahead at what instructions are coming up and try to schedule them so that they can be executed in parallel whenever possible. Since Haswell CPUs have 8 execution ports, and 4 of those ports can execute an integer add instruction, a Haswell processor can execute up to 4 add instructions at once.

In our original adding code, all 4 add instructions were independent of one another, so it was straightforward for the processor to run them in parallel. Now, with adc, each instruction depends on an output from the previous instruction. The processor has no choice but to execute the instructions serially, one after the other, instead of in parallel.

The performance difference is even more dramatic if we use SIMD (Single Instruction, Multiple Data) instructions. For example, a single vpaddq (Vector Packed Add Quadword) instruction does four 64-bit adds simultaneously. Combine that with the fact that Haswell processors can execute two vpaddqs per cycle, and you can see that we’re taking a serious performance hit in order to handle carries properly.

Eliminating carries, part 1: on paper

Back to base 10 for a minute. How can we eliminate the need for carries?

Let’s make some changes to how the number system works. First, we’ll extend the range of digits available. Instead of 0-9, we will use 0-9, A-Z, and *:

0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ*

(Yeah, I needed an extra character to make the numbers work out nicely. Bear with me.)

Although we have 37 digits, we are not using base 37. Numbers will still have “ones”, “tens”, and “hundreds” positions, just like a normal base 10 system. 29 still means 29, and 29 + 1 is still 30. The only difference is that digits happen to be capable of counting past 9: 29 + 1 could also be written as 2A, 1K, or even U.

With this new, more flexible number system, we can add without needing any carries!

  672415
+ 736606
--------
  DA8A1B

This trick won’t work for all numbers in our number system (e.g. 9 + W will need a carry), but it will work if the numbers we are adding are normalized, i.e. all of their digits are 9 or below. In fact, we can add up to four normalized numbers in this notation before any carries are possible:

  999   <-- largest possible normalized 3-digit number
  999
  999
+ 999
-----
  ***   <-- valid 3-digit result, no carries
            (recall that * is the highest digit)

So, with some clever tweaks to the number system, we’ve cheated our way out of some carries. Of course, at some point, we will need to convert from the 37-digit base 10 system back to normal base 10. We can do that by normalizing a number such that each of its digits is between 0 and 9:

  ¹¹ ¹ ¹
   DA8A1B
= 1409021

note:
D = 10 + 3
A = 10 + 0
B = 10 + 1

We normalize a number starting at the right, determining how many “tens” are in each digit, subtracting those “tens”, and carrying them to the next digit. 672415 and 736606 do in fact sum to 1409021, so the system works!

The key insight here is that we can use this technique to delay carry propagation until the end. We can’t avoid carry propagation altogether, but we can avoid it temporarily. If we save up the carries that occur during the intermediate additions, we can propagate them all in one go at the end.

Eliminating carries, part 2: computers

Carry propagation was at the heart of the performance problems we encountered earlier. As you’ve probably anticipated by now, we can use this technique to help speed up big number arithmetic!

Previously, we split a 256-bit number into four 64-bit pieces, since x86_64 processors operate on 64-bit integers. One way to understand this is to view the pieces as “digits” in base 264, since each digit has a value between 0 and 264 - 1 (inclusive).

In base 10, we kept the same base, but extended the range of digits that were allowed in order to prevent carries from occurring. Unfortunately, we can’t do that here – a 64-bit integer only has so many possible values, and we can’t change the hardware. Instead, we can get the same effect by reducing the size of the base.

Instead of splitting 256 bits into four base 264 digits, we’ll split 256 bits into five base 251 digits. Each digit can still range from 0 to 264 - 1, but the smaller base gives us the flexibility needed to prevent digits from needing a carry. This technique is generally referred to as “radix 251 representation” in the cryptography literature.

Here’s how it will look when we split 256 bits across five limbs (i.e. digits):

|            [--------------------- 52 bits --------------------]|
|             [-------------------- 51 bits --------------------]|
|             [-------------------- 51 bits --------------------]|
|             [-------------------- 51 bits --------------------]|
|             [-------------------- 51 bits --------------------]|

Each limb has 51 (or 52) bits of the original 256-bit number. The remaining 12 or 13 bits give us the extra “digits” we need for preventing carries. Effectively, the highest bits of each limb are reserved as storage for any carries that occur during the computation.

In our base 10 example, 37 digits allowed us to add up to four normalized numbers before needing to propagate carries. In radix 251 representation, 264 digits allow us to add up to 213 normalized numbers before we need to worry about the high 13 bits overflowing.

Aside: Why 13 bits instead of 12? For our purposes, we’re going to ignore the carries in the most significant limb, allowing numbers to wrap when they overflow past 2256 - 1 (just like how unsigned addition works in C with normal size integer types). As a result, we can assign 52 bits to the most significant limb and ignore the fact that it will run out of room for carries before the other limbs do.

With this new representation, our addition code now looks like:

; Assume x is split across A, B, C, D, E (A = most significant)
; and assume y is split across F, G, H, I, J (F = most significant)
add A, F
add B, G
add C, H
add D, I
add E, J
; Parallel goodness, yay!

Despite the fact that we now need 5 adds instead of 4, addition is much faster due to the lack of carries.

Of course, we also need code to normalize a number by propagating carries.

; Assume x is split across A, B, C, D, E (A = most significant)
; Register T is for temporary storage

mov T, E ; Copy E into T
shr T, 51 ; Shift out everything except the carries
add D, T ; Add carries from E into D
and E, 0x0007FFFFFFFFFFFF ; Zero out the carries in E

mov T, D ; Copy D into T
shr T, 51 ; Shift out everything except the carries
add C, T ; Add carries from D into C
and D, 0x0007FFFFFFFFFFFF ; Zero the carries in D

mov T, C ; Copy C into T
shr T, 51 ; Shift out everything except the carries
add B, T ; Add carries from C into B
and C, 0x0007FFFFFFFFFFFF ; Zero the carries in C

mov T, B ; Copy B into T
shr T, 51 ; Shift out everything except the carries
add A, T ; Add carries from B into A
and B, 0x0007FFFFFFFFFFFF ; Zero the carries in B

and A, 0x000FFFFFFFFFFFFF ; Zero the carries in A

Amazingly, some quick and dirty benchmarks show that radix 251 addition already outperforms radix 264 addition on my Haswell CPU for as few as three additions – and that’s including the cost of converting to and from radix 251 representation. The performance savings scale up appropriately as the number of additions increases.

Subtraction

So far we’ve only looked at addition. It’s straightforward though to extend this technique to subtraction. The main difference between addition and subtraction is that subtraction has negative carries.

Previously, we treated all limbs (and their carries) as unsigned integers. To support subtraction, we can treat limbs as signed integers, allowing individual digits to be either positive or negative. With this change, each limb can store either a positive or negative carry.

A side effect of this is that the most significant bit of each limb is now reserved as a sign bit. This lowers the number of operations we can perform between normalizations from 213 to 212 – a small sacrifice in most cases.

I find this technique rather fascinating because of how counterintuitive it is: by spreading data across more registers and using more operations, performance is actually improved. I hope you found it as interesting as I did!