The ideal intermediate result

Floating point operations follow a remarkably simple process. For any numeric arguments

Compute an ideal intermediate result, as if with unbounded range and precision.
Project the intermediate result into the form of the destination.

It's the fast, economical execution of this perfectly logical process that poses challenges.

Of course, we don't really need unlimited precison in the intermediate result. We need just enough information to provide a normalized (if possible), correctly-rounded result.

An ideal intermediate

This diagram illustrates one form of intermediate result in the 8-bit sample system.

    +-------------------+-------+
  ± | b • b b b b b b b | G R S |  *  2**k
    +-------------------+------ +

The string of b bits represents the 8 significant bits – the precision – of the destination value. The binary point here is positioned after the leading significant bit, but that's just a convention. The exponent k of the scale factor aligns the binary point to its position in the mathematical value.

Guard and Round are the next two bits beyond the rightmost b. The Sticky bit is the logical OR of all bits to the right of Round, in the infinitely precise result – it provides the sense of unbounded precision. The name “sticky bit” came into common usage during the development of IEEE 754.

The sign and the exponent k of the scale factor \( 2^{k} \) follow from the normal rules of arithmetic, with no restrictions on k. During projection of the result to the destination, we'll see the roles these bits play. Zero has no algebraic sign in mathematics, but if the destination supports a sign of zero, those rules come into play during projection.

`add(x, y)` and `sub(x, y)`

You can read about arithmetic on bytes in the Number Hall. Floating point add() and sub() are made complicated by the shifting required to align the binary points of the input values, and then the realignment during projection.

As a first example, let's compute \( 1 + 1 \). Depending on the operation and the signs of the operands, we either add magnitudes or subtract magnitudes. Add magnitude may carry out, as here, so we're careful not to drop the Sticky bit during the right shift.

        1 • 0 0 0 0 0 0 0
      + 1 • 0 0 0 0 0 0 0
      -------------------
      +-------------------+-------+
    1 | 0 • 0 0 0 0 0 0 0 | 0 0 0 | with carry-out
      +-------------------+------ +
      +-------------------+-------+
      | 1 • 0 0 0 0 0 0 0 | 0 0 0 |  *  2
      +-------------------+------ +

Now let's try \( 1 + 2^{-99} \). That lone 1 bit shifted wide right plays a role later, when the value is rounded.

        1 • 0 0 0 0 0 0 0
      + 0 • 0 0 0 0 0 0 0 0 0 . . . 0 0 0 1
      -------------------------------------
      +-------------------+-------+
      | 1 • 0 0 0 0 0 0 0 | 0 0 1 |
      +-------------------+------ +

Cancellation is the word used to describe the effect of subtracting two nearly equal magnitudes. Consider the difference \( 1 - ( 1 - 2^{-8} ) \) of two adjacent values in our 8-bit number system.

        1 • 0 0 0 0 0 0 0
      - 0 • 1 1 1 1 1 1 1 1
      ---------------------
      +-------------------+-------+
      | 0 • 0 0 0 0 0 0 0 | 1 0 0 |
      +-------------------+------ +

The result requires an 8-bit left shift to be normalized. The left shift during cancellation sometimes magnifies tiny rounding errors, compromising a calculation. So he possible hazards of cancellation must be considered in any analysis of a numerical algorithm.

Magnitude subtraction of normalized numbers with the same precision has interesting special cases. Can you see that if the input values are more than one bit out of alignment, then the left shift to renormalize is either zero or one bit?

Answer: Here is the worst-case example of the smallest normalized magnitude minus the largest magnitude shifted right two bits. The result requires just a one-bit left shift before rounding.

        1 • 0 0 0 0 0 0 0
      - 0 • 0 1 1 1 1 1 1 1 1
      -----------------------
      +-------------------+-------+
      | 0 • 1 0 0 0 0 0 0 | 0 1 0 |
      +-------------------+------ +

Can you also see that when a larger shift is required, that the result will require no rounding? All of its rightmost bits are zero.

Answer: When the two magnitudes align, the Guard, Round, and Sticky bits are all zero — there is nothing to round. When the two magnitudes are aligned by a one-bit shift, normalization is required if there is a borrow from the leftmost bit.

        1 • a a a a a a a
      - 0 • 1 b b b b b b b
      ---------------------
      +-------------------+-------+
      | 0 • c c c c c c c | c 0 0 |
      +-------------------+------ +

In this case, the normalization will leave the Guard bit zero, so again the result is exact.

`mul(x, y)`

Floating point multiplication is a matter of computing the product of the two 8-bit inputs, then gathering the lowest bits into Sticky. The multiplication can be seen as the binary flavor of everyday long multiplication.

An example like \( ( 1 + 2^{-7} ) * ( 1 + 2^{-7} ) \) spares us a blizzard of 1 bits. The exact product is \( 1 + 2^{-6} + 2^{-14} \) where the tiny third term is collected into Sticky.

        1 • 0 0 0 0 0 0 1
      * 1 • 0 0 0 0 0 0 1
      -------------------
        0 0 • 0 0 0 0 0 0 1 0 0 0 0 0 0 1
             ...lines of zeros...
      + 0 1 • 0 0 0 0 0 0 1
      -----------------------------------
        0 1 • 0 0 0 0 0 1 0 0 0 0 0 0 0 1 
      +-------------------+-------+
      | 0 • 1 0 0 0 0 0 1 | 0 0 1 |  *  2
      +-------------------+------ +

The double-wide product may need to be shifted right one bit to alight the binary point after the most significant bit. This introduces an extra factor of \( 2 \), which can be added into the exponent k.

This case fills out the parallelogram with 1 bits, but it is tedious to carry out by hand.

          1 • 1 1 1 1 1 1 1
      *   1 • 1 1 1 1 1 1 1
      -----------------------------------
        1 1 • 1 1 1 1 1 0 0 0 0 0 0 0 0 1
      +-------------------+-------+
      | 1 • 1 1 1 1 1 1 0 | 0 0 1 |  *  2
      +-------------------+------ +

It's easier to see in mathematical form \( ( 2 - 2^{-7} ) * ( 2 - 2^{-7} ) \), where it simplifies to \( 4 - 2^{-5} + 2^{-14} \), and then normalizes to \( (2 - 2^{-6} + 2^{-15}) \times 2 \).

`div(x, y)`

Floating point division can likewise be carried out with the binary version of pencil-and-paper long division. Fast implementations, including the famous Pentium divide instruction, are discussed outside this booklet. The idea is to compute the quotient out to at least the Round bit. Then, the Sticky bit is 1 if there is any nonzero remainder.

A zero divisor is one of the Special Cases that fall outside simple arithmetic. Visit that page for details.

Let's look at \( ( 1 + 2^{-7} ) / ( 1 - 2^{-8} ) \), the number just bigger than \( 1 \) divided by the number just less than \( 1 \).

The quotient is \[ ( 1 + 2^{-7} ) * ( 1 + 2^{-8} + 2^{-16} + 2^{-24} + \cdots ) \] which simplifies to \( 1 + 2^{-7} + 2^{-8} + 2^{-15} + 2^{-16} + \cdots \), with all the trailing terms absorbed into Sitcky.

Most of the quotient bits are zero, so it's not too painful to carry out this division by hand. In any case, the ideal result appears as

      +-------------------+-------+
      | 1 • 0 0 0 0 0 0 1 | 1 0 1 |
      +-------------------+------ +

`sqrt(x)` and `remainder(x, y)`

The pencil-and-paper technique for extracting a square root applies readily to binary arithmetic, as shown in the Number Hall. As with division, the idea is to compute the root out to at least the Round bit. Then, the Sticky bit is 1 if the radicand has not been reduced to zero.

The remainder() operation is different. Its intermediate value is always exact, but the number of integer quotient bits computed in order to arrive at the remainder may be huge.

Logarithms, exponentials, trigonometric functions

The previous sections have shown how the computational model presented here, with its ideal intermediate result, is mathematically feasible for the most common operations.

The situation for the transcendental functions is more complicated, so the requirement for correctly-rounded results rarely applies. Computing extra-precise intermediate values may be too expensive. In any case, there is the Table-Maker's Dilemma: that near half-way cases will arise, requiring many extra significant bits to resolve. (See two papers by W. Kahan in the Library, for example.)

Next Home Previous

The ideal intermediate result

An ideal intermediate

add(x, y) and sub(x, y)

mul(x, y)

div(x, y)

sqrt(x) and remainder(x, y)

Logarithms, exponentials, trigonometric functions

`add(x, y)` and `sub(x, y)`

`mul(x, y)`

`div(x, y)`

`sqrt(x)` and `remainder(x, y)`