The previous pages have brought us an intermediate result of the form
+-------------------+-------+ ± | b • b b b b b b b | G R S | * 2**k +---------------------+-----+
At this point, we treat the Sticky bit
S
as just a trailing
significant bit, despite its exotic definition.
The goal is to fit the value into the normal form
+-------------------+ ± | 1 • b b b b b b b | * 2**e +-------------------+
if possible,
where e
is constrained by
\( e_{min} \leq e \leq e_{max} \)
given the limitations of the exponent
range of the destination.
We must turn to the expanded rules
of computer arithmetic when we depart
from everyday mathematics.
If all of the significant bits are 0
,
including G
, R
, and S
,
then the result is zero.
Because zero falls outside the parallelogram of
normal values of the number system, it fits
into the special cases of the next page.
With the goal of fitting a nonzero result into
the parallelogram, a first logical step is to
normalize it. If the leading bit b
is 1
, the value is normalized.
Otherwise, it has the form
This example has four leading+-------------------+-------+ ± | 0 • 0 0 0 1 c d e | G R S | * 2**k +-------------------+-------+
0
bits.
To normalize, shift all the bits through Round,
decrementing the exponent k
for each shift.
Leave Sticky in place.
For the example shown, the result is:
+-------------------+-------+ ± | 1 • c d e G R 0 0 | 0 0 S | * 2**(k - 4) +-------------------+-------+
Care is taken never to promote nonzero Sticky by shifting
it left into the significant bits.
However, it is only in the case of subtraction of nearly identical
magnitudes in add()
and sub()
that a normalization shift of more than one bit
can arise. In that case S
is
guaranteed to be zero.
Left shifts of Sticky are not a problem, in practice.
After filtering zero results and normalizing, we have a value of the form
+-------------------+-------+ ± | 1 • b b b b b b b | G R S | * 2**k +-------------------+------ +
The traditional approach was to round the value down
to eight significant bits and then check for
k
out of range.
In the IEEE 754 era, we exploit the bottom
right corner of the parallelogram of values by
first checking for \( k < e_{min} \) and
denormalizing if necessary.
If k
is less than the minimum normal
exponent, we have exponent underflow.
Shift the significant bits right,
adding \( 1 \) to k
for each bit shifted,
until it reaches the minimum normal exponent.
Zero bits shift in from the left and bits shifted
off the right are logically OR-ed into S
.
Underflow may result in a value like this
+-------------------+-------+ ± | 0 • 0 0 0 0 1 b b | G R S | * 2**e_min +-------------------+------ +
or in an extreme case all the significant bits may
be shifted into S
.
The rounding rules still apply.
+-------------------+-------+ ± | 0 • 0 0 0 0 0 0 0 | 0 0 1 | * 2**e_min +-------------------+------ +
After possible denormalization, the value looks like this:
+-------------------+-------+ ± | b • b b b b b b b | G R S | * 2**k +-------------------+------ +
In the mainframe era of the 1960s, most computers
would simply truncate the result by
treating all bits from G
rightward
as zero.
Some minicomputers of the 1970s would add \( 1 \)
to the G
bit and truncate that result.
Visit the
Dinosaur Gallery
for further information (when it appears).
IEEE 754 ushered in a whole new era by leveraging the power of the new microprocessors to support four different kinds of rounding.
Here is what we mean by rounding. Given a value
+-------------------+-------+ ± | b • b b b b b b b | G R S | +-------------------+------ +
regardless of sign, we
round up (in magnitude)
by adding \( 1 \)
into the lowest-order b
.
If every b
is a
1
, then there is a carry out
of the leading 1
, so the
sigificant bits must be right-shifted
one place and the exponent k
incremented.
We round down by taking no action
on the significant bits. In either case,
we discard
G
, R
, and S
after rounding.
G
, R
, and S
is nonzero, then round up; otherwise, round down
G
, R
, and S
is nonzero, then round up; otherwise round down
G
is 1
and either
(a) R
or S
is 1
or (b) the lowest-order b
is 1
;
otherwise, round down
Despite the technical language, the concepts are simple.
Let's think of
G
, R
, and S
as the rounding bits.
If they're all zero, the result needs no rounding.
If the rounding bits are nonzero, then the intermediate value lies between representable values in the parallelogram. Round toward Zero, or toward \( + \infty \), or toward \( - \infty \) chooses the representable value in the relevant direction.
Round to Nearest chooses the nearer of the two adjacent
representable values. We say this rounding is unbiased
because it breaks ties on the basis of the bit just left
of G
. It's reasonable to expect that bit to be
0
or 1
with equal probability.
It's helpful to list the cases:
± | b • b b b b b b b | 0 R S round down ± | b • b b b b b b b | 1 1 S round up ± | b • b b b b b b b | 1 R 1 round up ± | b • b b b b b b 0 | 1 0 0 round down ± | b • b b b b b b 1 | 1 0 0 round up
The last two cases illustrate an intermediate result halfway between two representable values. The rounding is unbiased in that it rounds up only if the least significant bit, just to the left, is one.
After rounding, we have a value
+-------------------+ ± | b • b b b b b b b | * 2**k +-------------------+
with k
no less than the smallest
normal exponent. If k
is larger
than the largest normal exponent, the
exponent overflow arises.
The magnitude is too large to represent.
Overflow is an extraordinary circumstance, in which the delivered result provides no more than an upper or lower bound on the magnitude of the mathematical result. The IEEE 754 approach to overflow is to deliver either \( \infty \) or the maximum normal number, with the sign of the result, depending on the rounding mode and sign of the result.