Fast inverse square root

Tags: ,
June 17, 2020

Let’s consider the following seemingly simple problem. Given a number \(x\), calculate its inverse (reciprocal) square root \(\frac{1}{\sqrt{x}}\). This is a very practical problem in finding a normal vector for use in graphics calculations. Definitely not a hard math problem, right? As a computation problem, however, there is some subtlety involved.

Nowadays doing such floating-point operations is inexpensive, especially with compiler optimizations that will result in the x86 instruction rsqrtss. Back in the late 1990s however, quickly calculating such an expression, especially if the calculation was needed very often, represented a real issue. At the time, the accepted methods drew on lookup tables constructed to allow the computation of inverse square roots.

In the source code of the game Quake III a very different approach is found that uses a mysterious approximation method. Here is the code (written in C), along with the original colorful comments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                      // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );              // what the fuck?
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );  // 1st iteration
//  y  = y * ( threehalfs - ( x2 * y * y ) );  // 2nd iteration, this can be removed
    
    return y;
}

What is going on here? The constant 0x5f3759df seems to come from nowhere, there’s a bit operation for some reason, and there is some mysterious switching of types going on back and forth between float and long. Let’s start to pick this code apart to understand what is happening.

32-Bit Floating-Point Format

A crucial step in this algorithm is line 9 (evil floating point bit level hacking) where we convert our given floating-point number into long, an integer data type. What exactly does this mean? For every floating-point number \(x\), we can rewrite it as a normalized binary number so that:

\[ x = \pm 2^{e_x} (1 + m_x) \]

where \(e_x\) is an integer and \(m_x \in [0, 1)\). Note the similarity to scientific notation, all that we have done is use base 2 instead. To represent this as a 32-bit binary number, we alter these values with certain constants:

\[\begin{align} S_x &= \left\{\begin{array}{lr} 0, & \text{for } x > 0 \\ 1, & \text{for } x < 0 \\ \end{array}\right\}\\ E_x &= e_x + B, \quad B = 127\\ M_x &= m_x \times L, \quad L = 2^{23} \end{align}\]

We are simply adding these constants to store our values. As an example, we would represent the floating point number \(0.15625 = 2^{-3} \times 1.25\) as:

Float example.svg

since we have:

\[\begin{align} S_x &= 0\\ E_x &= -3 + 127 = 124 = \text{0111 1100}_2\\ M_x &= 0.25 \times 2^{23} = \text{010 0000 0000 0000 0000 0000}_2 \end{align}\]

Note that the part following the exponent \(M_x\) is referred to as the mantissa.

This conversion is exactly what is happening in line 9 with i = * ( long * ) &y;, where we store this binary representation. Since we have concatenated these values, we have that this binary integer representation is equal to:

\[ I_x = E_x L + M_x \]

since in concatenating we shifted \(E_x\) 23 bits and added \(M_x\).

A Logarithmic Approximation

First, note that since we are taking square roots, we can assume that our value \(x\) is positive. Since we have our number in a format that is multiplied by a power of two, we can do the following:

\[\begin{align} \log_2(x) &= \log_2(2^{e_x} (1 + m_x))\\ &= \log_2(2^{e_x}) + \log_2(1 + m_x)\\ &= e_x + \log_2(1 + m_x) \end{align}\]

Let’s find a series expansion for the remaining logarithm. Recall the geometric series:

\[ \frac{1}{1+t} = 1 - t + t^2 - t^3 + \dots \]

but we also have that:

\[ \ln(1+x) = \int_0^\infty \frac{1}{1+t} \, \mathrm{d}t \]

so that integrating term by term gives:

\[\begin{align} \ln(1+x) &= x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} + \dots\\ &= x + O(x^2) \end{align}\]

So we can approximate \(\log_2(1 + m_x) \approx m_x + \sigma\), where \(\sigma\) is a constant that we will use to serve as a correction for the \(O(x^2)\) term. Together we have the approximation:

\[ \log_2(x) \approx e_x + m_x + \sigma \]

Using this approximation, we can rewrite the 32-bit representation that we found above:

\[\begin{align} I_x &= E_x L + M_x\\ &= (e_x + B)L + (m_x \times L)\\ &= L(e_x + B + m_x)\\ &= L(e_x + m_x + \sigma + B - \sigma)\\ &= L(e_x + m_x + \sigma) + L(B - \sigma)\\ &\approx L \log_2(x) + L(B - \sigma) \end{align}\]

so that

\[ \log_2(x) \approx \frac{I_x}{L} - (B - \sigma) \]

So getting back to the original problem, let \(y = \frac{1}{\sqrt{x}}\). Then we have:

\[\begin{align} \log_2(y) &= \log_2\left( \frac{1}{\sqrt{x}} \right) \\ &= -\frac{1}{2} \log_2(x)\\ \frac{I_y}{L} - (B - \sigma) & \approx -\frac{1}{2} \left( \frac{I_x}{L} - (B - \sigma) \right)\\ \frac{I_y}{L} & \approx \frac{1}{2}(B - \sigma) + (B - \sigma) - \frac{1}{2} \frac{I_x}{L}\\ I_y & \approx \frac{3}{2} L (B - \sigma) - \frac{1}{2} I_x \end{align}\]

Taking 0x5f3759df \(= \frac{3}{2} L (B - \sigma)\), this is exactly line 10 with i = 0x5f3759df - ( i >> 1 );

Newton’s Method

All of the above gave us an approximation for the value \(y = \frac{1}{\sqrt{x}}\). The final step is to refine this initial guess using Newton’s method, which allows us to use derivatives to approximate the roots of an equation. Suppose that we have a function \(f(y)\) and we start with an initial guess \(y_0\) as a root of our equation.

Letting our root be \(r = y_0 + h\) for some small \(h\), we can use the derivative as a linear approximation and have:

\[\begin{align} 0 &= f(y_0 + h) \\ &\approx f(y_0) + h f'(y_0)\\ h &\approx - \frac{f(y_0)}{f'(y_0)}\\ y_0 + h &\approx y_0 - \frac{f(y_0)}{f'(y_0)} \end{align}\]

In the context of finding the reciprocal square root of \(x\) we have the equation:

\[ 0 = f(y) = \frac{1}{y^2} - x \]

Using the initial guess \(y_0 = \frac{3}{2} L (B - \sigma) - \frac{1}{2} I_x\) interpreted as a floating-point number, we have the approximation:

\[\begin{align} y_0 + h &\approx y_0 - \frac{f(y_0)}{f'(y_0)}\\ y_0 + h &\approx y_0 - \frac{\frac{1}{y_0^2} - x}{-\frac{2}{y_0^3}}\\ y_0 + h &\approx y_0 + \left(\frac{1}{y_0^2} - x\right)\frac{y_0^3}{2}\\ y_0 + h &\approx y_0 + \frac{y_0}{2} - x\frac{y_0^3}{2}\\ y_0 + h &\approx \frac{3}{2}y_0 - x\frac{y_0^3}{2}\\ y_0 + h &\approx y_0\left(\frac{3}{2} - x\frac{y_0^2}{2}\right) \end{align}\]

which corresponds to line 12: y = y * ( threehalfs - ( x2 * y * y ) );

Selecting an Optimal Constant \(\sigma\)

Note that at this point that we haven’t derived our constant \(\sigma\) which gives the constant used in the function. How should we do so? One interesting thing to consider is that there are only a finite amount of floats, so we can test them all.

I found this paper which does exactly that, writing a function that checks all positive floats as our constant. I highly suggest giving it a read, it is an amazing thesis that gives a clear and complete account of the entire algorithm. I will give a brief account but suggest you simply read straight from the source.

First, the author shows how i = 0x5f3759df - ( i >> 1 ); can be rewritten in terms of the mantissa and exponent of i, with three cases corresponding to odd exponents, small even exponents, and large even exponents. Here small/large is defined by if underflow is caused in the subtraction. The code is as follows, testing the author’s derivation against every possible floating-point number:

#define S 190
#define T 3627487

int G(int w) {
    int E = (w >> 23) & 0xff; // extract E field
    int M = w & 0x7fffff; // extract M field

    int a, b; // the fields of the return float

    if ((E & 1) == 1) { // E Odd
        a = S - 1 - (E>>1);
        b = (1<<23) + T - (1<<22) - (M>>1);
    } else if ((M>>1) <= T) { // E Even M Small
        a = S - (E>>1);
        b = T - (M>>1);
    } else { // E Even M Large
        a = S - 1 - (E>>1);
        b = (1<<23) + T - (M>>1);
    }

    assert(a == (a & 0xff));
    assert(b == (b & 0x7fffff));

    // put new fields back in word
    return (a << 23) | b;
}

void checkDiscrepancies() {
    int i;
    // test all positive floats
    for (i = 0x00000001; i < 0x80000000; i++)
        assert(G(i) == 0x5f3759df - (i >> 1));
}

Now, we generalize this idea to work for any constant. First, the exponent is shown to be 190, and the mantissa of the constant minus our bit shifted number as the following:

\[ \begin{align} y(x) &= \left\{\begin{array}{lr} 1 +t - x/2 & E_0 = 0, &\quad x \leq 2t\\ 1+t/2 - x/4, & E_0 = 0, &\quad x > 2t\\ \sqrt{2}(3/4 + t/2 - x/4) & E_0 = 1\\ \end{array}\right\}\\ \end{align} \]

where x is the mantissa of our input and t is the mantissa of the constant for which we are searching. He then creates an expression of the relative error of each constant and shows that the mantissa of the optimal constant corresponds to the solution of:

\[ 4t^6 + 36t^5 + 81t^4 - 216t^3 - 972t^2 - 2916t + 1458 = 0 \]

which gives

\[ \sigma = 0.4327448899594431954685215869960103736198 \]

corresponding to the value 0x5f37642f, different than the Quake II code. However, this ignores the fact that we will be using Newton’s method, which will perform better for an underestimate than the actual answer. The author derives a similar relation as above, but applied after the use of Newton’s method, corresponding to

\[ \sigma = 0.4324500847901426421787829374967964668614 \]

which gives the constant as 0x5f375a86, for a maximum relative error of 0.0017512378. This is more accurate than the original, and the author continues by giving refinements for double and quadruple precision.

Unfortunately, the exact method by which the constant used in Quake III was derived appears to be lost to time, though there appears to have been some trial and error paired with theory. Some discussion of its authorship can be found here and here.