Floating point operations follow a remarkably simple process. For any numeric arguments
It's the fast, economical execution of this perfectly logical process that poses challenges.
Of course, we don't really need unlimited precison in the intermediate result. We need just enough information to provide a normalized (if possible), correctly-rounded result.
This diagram illustrates one form of intermediate result in the 8-bit sample system.
+-------------------+-------+ ± | b • b b b b b b b | G R S | * 2**k +-------------------+------ +
The string of b
bits represents the 8
significant bits
– the precision –
of the destination value.
The binary point here
is positioned after the leading significant bit, but
that's just a convention. The exponent k
of the scale factor
aligns the binary point to its position in the mathematical
value.
Guard and Round are the next two
bits beyond the rightmost b
.
The Sticky bit is the logical OR
of
all bits
to the right of Round,
in the infinitely precise result – it
provides the sense of
unbounded precision.
The name “sticky bit” came into common usage during the development
of IEEE 754.
The sign and the exponent k
of the
scale factor \( 2^{k} \) follow from the normal
rules of arithmetic, with no restrictions on k
.
During
projection
of the result to the destination, we'll see
the roles these bits play.
Zero has no algebraic sign in mathematics, but
if the destination supports a sign of zero, those
rules come into play during projection.
add(x, y)
and sub(x, y)
You can read about arithmetic on bytes in the
Number Hall.
Floating point add()
and sub()
are
made complicated by the shifting required to align the
binary points of the input values, and
then the realignment during projection.
As a first example, let's compute \( 1 + 1 \). Depending on the operation and the signs of the operands, we either add magnitudes or subtract magnitudes. Add magnitude may carry out, as here, so we're careful not to drop the Sticky bit during the right shift.
1 • 0 0 0 0 0 0 0 + 1 • 0 0 0 0 0 0 0 ------------------- +-------------------+-------+ 1 | 0 • 0 0 0 0 0 0 0 | 0 0 0 | with carry-out +-------------------+------ + +-------------------+-------+ | 1 • 0 0 0 0 0 0 0 | 0 0 0 | * 2 +-------------------+------ +
Now let's try \( 1 + 2^{-99} \).
That lone 1
bit shifted wide right plays a role
later, when the value is rounded.
1 • 0 0 0 0 0 0 0 + 0 • 0 0 0 0 0 0 0 0 0 . . . 0 0 0 1 ------------------------------------- +-------------------+-------+ | 1 • 0 0 0 0 0 0 0 | 0 0 1 | +-------------------+------ +
Cancellation is the word used to describe the effect of subtracting two nearly equal magnitudes. Consider the difference \( 1 - ( 1 - 2^{-8} ) \) of two adjacent values in our 8-bit number system.
1 • 0 0 0 0 0 0 0 - 0 • 1 1 1 1 1 1 1 1 --------------------- +-------------------+-------+ | 0 • 0 0 0 0 0 0 0 | 1 0 0 | +-------------------+------ +
The result requires an 8-bit left shift to be normalized. The left shift during cancellation sometimes magnifies tiny rounding errors, compromising a calculation. So he possible hazards of cancellation must be considered in any analysis of a numerical algorithm.
Magnitude subtraction of normalized numbers with the same precision has interesting special cases. Can you see that if the input values are more than one bit out of alignment, then the left shift to renormalize is either zero or one bit?
Answer: Here is the worst-case example of the smallest normalized magnitude minus the largest magnitude shifted right two bits. The result requires just a one-bit left shift before rounding.
1 • 0 0 0 0 0 0 0 - 0 • 0 1 1 1 1 1 1 1 1 ----------------------- +-------------------+-------+ | 0 • 1 0 0 0 0 0 0 | 0 1 0 | +-------------------+------ +
Can you also see that when a larger shift is required, that the result will require no rounding? All of its rightmost bits are zero.
Answer: When the two magnitudes align, the Guard, Round, and Sticky bits are all zero — there is nothing to round. When the two magnitudes are aligned by a one-bit shift, normalization is required if there is a borrow from the leftmost bit.
1 • a a a a a a a - 0 • 1 b b b b b b b --------------------- +-------------------+-------+ | 0 • c c c c c c c | c 0 0 | +-------------------+------ +
In this case, the normalization will leave the Guard bit zero, so again the result is exact.
mul(x, y)
Floating point multiplication is a matter of computing the product of the two 8-bit inputs, then gathering the lowest bits into Sticky. The multiplication can be seen as the binary flavor of everyday long multiplication.
An example like \( ( 1 + 2^{-7} ) * ( 1 + 2^{-7} ) \)
spares us a blizzard of 1
bits.
The exact product is
\( 1 + 2^{-6} + 2^{-14} \)
where the tiny third term is collected into Sticky.
1 • 0 0 0 0 0 0 1 * 1 • 0 0 0 0 0 0 1 ------------------- 0 0 • 0 0 0 0 0 0 1 0 0 0 0 0 0 1 ...lines of zeros... + 0 1 • 0 0 0 0 0 0 1 ----------------------------------- 0 1 • 0 0 0 0 0 1 0 0 0 0 0 0 0 1 +-------------------+-------+ | 0 • 1 0 0 0 0 0 1 | 0 0 1 | * 2 +-------------------+------ +
The double-wide product may need to be shifted right
one bit to alight
the binary point after the most significant bit.
This introduces an extra factor of \( 2 \), which can
be added into the exponent k
.
This case fills out the parallelogram with 1 bits, but it is tedious to carry out by hand.
1 • 1 1 1 1 1 1 1 * 1 • 1 1 1 1 1 1 1 ----------------------------------- 1 1 • 1 1 1 1 1 0 0 0 0 0 0 0 0 1 +-------------------+-------+ | 1 • 1 1 1 1 1 1 0 | 0 0 1 | * 2 +-------------------+------ +
It's easier to see in mathematical form \( ( 2 - 2^{-7} ) * ( 2 - 2^{-7} ) \), where it simplifies to \( 4 - 2^{-5} + 2^{-14} \), and then normalizes to \( (2 - 2^{-6} + 2^{-15}) \times 2 \).
div(x, y)
Floating point division can likewise be carried
out with the binary version of pencil-and-paper
long division. Fast implementations, including
the famous Pentium divide instruction, are discussed
outside this booklet.
The idea is to compute the quotient out to
at least the Round bit. Then, the
Sticky bit is 1
if there
is any nonzero remainder.
A zero divisor is one of the Special Cases that fall outside simple arithmetic. Visit that page for details.
Let's look at \( ( 1 + 2^{-7} ) / ( 1 - 2^{-8} ) \), the number just bigger than \( 1 \) divided by the number just less than \( 1 \).
The quotient is \[ ( 1 + 2^{-7} ) * ( 1 + 2^{-8} + 2^{-16} + 2^{-24} + \cdots ) \] which simplifies to \( 1 + 2^{-7} + 2^{-8} + 2^{-15} + 2^{-16} + \cdots \), with all the trailing terms absorbed into Sitcky.
Most of the quotient bits are zero, so it's not too painful to carry out this division by hand. In any case, the ideal result appears as
+-------------------+-------+ | 1 • 0 0 0 0 0 0 1 | 1 0 1 | +-------------------+------ +
sqrt(x)
and remainder(x, y)
The pencil-and-paper technique for extracting
a square root applies readily to binary
arithmetic, as shown in the
Number Hall.
As with division, the idea is to compute
the root out to
at least the Round bit. Then, the
Sticky bit is 1
if
the radicand has not been reduced
to zero.
The remainder()
operation is different.
Its intermediate value is always exact, but the
number of integer quotient bits computed in order
to arrive at the remainder may be huge.
The previous sections have shown how the computational model presented here, with its ideal intermediate result, is mathematically feasible for the most common operations.
The situation for the transcendental functions is more complicated, so the requirement for correctly-rounded results rarely applies. Computing extra-precise intermediate values may be too expensive. In any case, there is the Table-Maker's Dilemma: that near half-way cases will arise, requiring many extra significant bits to resolve. (See two papers by W. Kahan in the Library, for example.)