**Faster addition and subtraction on modern CPUs**

Do you remember how to do long addition on paper?

```
¹¹ ¹
6876
+ 3406
------
10282
```

Starting from the “ones” position, we add 6 + 6 = 12, write down a 2 and carry a 1. We proceed to the left, one position at a time, until there are no more digits to add.

When implementing addition for large integers (e.g. 2^{64} and above), it’s common to write
code that looks quite similar to this algorithm.
Interestingly, there’s a straightforward trick that can speed up this
process enormously on modern CPUs.

But first, a question: why do we start long addition with the “ones”? Why not start on the left?

The answer, of course, is the carries. We can’t figure out for sure what a given digit of the answer will be until we’ve completed all of the additions to the right of that digit.

Imagine if we tried to add left-to-right instead:

6 + 3 = 9. So the first digit is 9.

8 + 4 = 12. OK, the second digit is 2… but carry a 1, so the first digit was actually 9 + 1 = 10… now carry backthat1…

For mental math, this isn’t too bad (and some people actually prefer it when working with small enough numbers). As an algorithm, however, this approach has some fundamental limitations that become clear when working with larger numbers. Most importantly, because the later parts of the computation rely on information from the earlier parts of the computation, it’s hard to split up and parallelize the work.

## What about computers?

Computers don’t work in base 10, of course. Instead, modern desktop and server CPUs expose an interface for operating on (for the most part) 64-bit integers.

As long as our numbers fit within a single 64-bit value, things are easy.
But what if we want to add, say, two 256-bit integers, `x`

and `y`

?

The obvious solution would be to break up each 256-bit number into four 64-bit
pieces (commonly referred to as “limbs”).
Place the highest 64 bits of `x`

into register A,
the next 64 bits into register B,
and so on for registers C and D.
Do the same for `y`

with registers E, F, G, H.

Now we can add `x`

and `y`

by adding the corresponding parts:

But wait, this might give us the wrong result! If one of the last three additions overflow, then we need to “carry” that extra 1 up to the next 64-bit piece. Oh hey, does that sound familiar?

Fortunately, x86 has a dedicated instruction for this called “add with carry”.
`adc`

will automatically check if the previous operation overflowed, adding 1
if needed.
Here’s how the proper code would look:

Just like with long addition in base 10, we start with the least-significant “digits” (D and H) and work our way up to the most-significant “digits” (A and E), carrying 1s as needed along the way.

## But now it’s slow

Interestingly, our fixed code is slower than the original (incorrect) code. Much slower. Why is this?

The first reason is that `adc`

is just slower to execute than a normal `add`

on
most popular x86 CPUs.
Since `adc`

has a third input (the carry flag),
it’s a more complex instruction than `add`

.
It’s also used less often than `add`

,
so there is less incentive for CPU designers to spend chip area on optimizing
`adc`

performance.

The second reason is more interesting. Let’s look at the Intel Haswell microarchitecture as an example.

On a Haswell CPU, a single `add`

instruction takes 1 cycle to execute.
However, in ideal conditions, Haswell CPUs can execute up to 4 `add`

instructions in a single cycle.
How is this possible? Parallelism.
Modern processors look ahead at what instructions are coming up and try to
schedule them so that they can be executed in parallel whenever possible.
Since Haswell CPUs have 8 execution ports, and 4 of those ports can execute an
integer `add`

instruction, a Haswell processor can execute up to 4 `add`

instructions at once.

In our original adding code, all 4 `add`

instructions were independent of one
another, so it was straightforward for the processor to run them in parallel.
**Now, with adc, each instruction depends on an output from the previous
instruction.**
The processor has no choice but to execute the instructions serially, one after
the other, instead of in parallel.

The performance difference is even more dramatic if we use SIMD (Single
Instruction, Multiple Data) instructions.
For example, a single `vpaddq`

(Vector Packed Add Quadword) instruction does
four 64-bit adds simultaneously.
Combine that with the fact that Haswell processors can execute two `vpaddq`

s
per cycle, and you can see that we’re taking a serious performance hit
in order to handle carries properly.

## Eliminating carries, part 1: on paper

Back to base 10 for a minute. How can we eliminate the need for carries?

Let’s make some changes to how the number system works. First, we’ll extend the range of digits available. Instead of 0-9, we will use 0-9, A-Z, and *:

```
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ*
```

(Yeah, I needed an extra character to make the numbers work out nicely. Bear with me.)

Although we have 37 digits, we are *not* using base 37.
Numbers will still have “ones”, “tens”, and “hundreds” positions,
just like a normal base 10 system.
29 still means 29, and 29 + 1 is still 30.
The only difference is that
digits happen to be capable of counting past 9:
29 + 1 could also be written as 2A, 1K, or even U.

With this new, more flexible number system, we can add without needing any carries!

```
672415
+ 736606
--------
DA8A1B
```

This trick won’t work for all numbers in our number system (e.g. 9 + W will
need a carry),
but it will work if the numbers we are adding are
*normalized*, i.e. all of their digits are 9 or below.
In fact, we can add up to four normalized numbers in this notation before any
carries are possible:

```
999 <-- largest possible normalized 3-digit number
999
999
+ 999
-----
*** <-- valid 3-digit result, no carries
(recall that * is the highest digit)
```

So, with some clever tweaks to the number system, we’ve cheated our way out of
some carries.
Of course, at some point, we will need to convert from the
37-digit base 10 system back to normal base 10.
We can do that by *normalizing* a number such that each of its digits is
between 0 and 9:

```
¹¹ ¹ ¹
DA8A1B
= 1409021
note:
D = 10 + 3
A = 10 + 0
B = 10 + 1
```

We normalize a number starting at the right, determining how many “tens” are in each digit, subtracting those “tens”, and carrying them to the next digit. 672415 and 736606 do in fact sum to 1409021, so the system works!

The key insight here is that we can use this technique to delay carry propagation until the end. We can’t avoid carry propagation altogether, but we can avoid it temporarily. If we save up the carries that occur during the intermediate additions, we can propagate them all in one go at the end.

## Eliminating carries, part 2: computers

Carry propagation was at the heart of the performance problems we encountered earlier. As you’ve probably anticipated by now, we can use this technique to help speed up big number arithmetic!

Previously, we split a 256-bit number into four 64-bit pieces,
since x86_64 processors operate on 64-bit integers.
One way to understand this is to view the pieces as “digits”
in base 2^{64}, since each digit has a value
between 0 and 2^{64} - 1 (inclusive).

In base 10, we kept the same base, but extended the range of digits that were allowed in order to prevent carries from occurring. Unfortunately, we can’t do that here – a 64-bit integer only has so many possible values, and we can’t change the hardware. Instead, we can get the same effect by reducing the size of the base.

Instead of splitting 256 bits into four base 2^{64} digits,
we’ll split 256 bits into five base 2^{51} digits.
Each digit can still range from 0 to 2^{64} - 1,
but the smaller base gives us the flexibility needed to prevent digits from
needing a carry.
This technique is generally referred to as “radix 2^{51}
representation” in the cryptography literature.

Here’s how it will look when we split 256 bits across five limbs (i.e. digits):

```
| [--------------------- 52 bits --------------------]|
| [-------------------- 51 bits --------------------]|
| [-------------------- 51 bits --------------------]|
| [-------------------- 51 bits --------------------]|
| [-------------------- 51 bits --------------------]|
```

Each limb has 51 (or 52) bits of the original 256-bit number. The remaining 12 or 13 bits give us the extra “digits” we need for preventing carries. Effectively, the highest bits of each limb are reserved as storage for any carries that occur during the computation.

In our base 10 example,
37 digits allowed us to add up to four normalized numbers before needing to
propagate carries.
In radix 2^{51} representation,
2^{64} digits allow us to add up to 2^{13} normalized
numbers before we need to worry about the high 13 bits overflowing.

*Aside: Why 13 bits instead of 12?
For our purposes, we’re going to ignore the carries in the most significant limb,
allowing numbers to wrap when they overflow past 2 ^{256} - 1 (just like
how unsigned addition works in C with normal size integer types).
As a result, we can assign 52 bits to the most significant limb and ignore the
fact that it will run out of room for carries before the other limbs do.*

With this new representation, our addition code now looks like:

Despite the fact that we now need 5 `add`

s instead of 4,
addition is much faster due to the lack of carries.

Of course, we also need code to normalize a number by propagating carries.

Amazingly, some quick and dirty benchmarks show that
**radix 2 ^{51} addition already outperforms radix 2^{64}
addition on my Haswell CPU for as few as three additions – and that’s
including the cost of converting to and from
radix 2^{51} representation**.
The performance savings scale up appropriately as the number of additions
increases.

## Subtraction

So far we’ve only looked at addition.
It’s straightforward though to extend this technique to subtraction.
The main difference between addition and subtraction is that subtraction has
*negative* carries.

Previously, we treated all limbs (and their carries) as unsigned
integers.
To support subtraction, we can treat limbs as *signed* integers,
allowing individual digits to be either positive or negative.
With this change, each limb can store either a positive or negative carry.

A side effect of this is that the most significant bit of each limb is now
reserved as a sign bit.
This lowers the number of operations we can perform between normalizations from
2^{13} to 2^{12} – a small sacrifice in most cases.

I find this technique rather fascinating because of how counterintuitive it is: by spreading data across more registers and using more operations, performance is actually improved. I hope you found it as interesting as I did!