Projection onto the target range and precision

The previous pages have brought us an intermediate result of the form

    +-------------------+-------+
  ± | b • b b b b b b b | G R S |  *  2**k
    +---------------------+-----+

At this point, we treat the Sticky bit S as just a trailing significant bit, despite its exotic definition.

The goal is to fit the value into the normal form

    +-------------------+
  ± | 1 • b b b b b b b |  *  2**e
    +-------------------+

if possible, where e is constrained by \( e_{min} \leq e \leq e_{max} \) given the limitations of the exponent range of the destination. We must turn to the expanded rules of computer arithmetic when we depart from everyday mathematics.

Dispatch zero

If all of the significant bits are 0, including G, R, and S, then the result is zero. Because zero falls outside the parallelogram of normal values of the number system, it fits into the special cases of the next page.

Normalize

With the goal of fitting a nonzero result into the parallelogram, a first logical step is to normalize it. If the leading bit b is 1, the value is normalized. Otherwise, it has the form

    +-------------------+-------+
  ± | 0 • 0 0 0 1 c d e | G R S |  *  2**k
    +-------------------+-------+

This example has four leading 0 bits. To normalize, shift all the bits through Round, decrementing the exponent k for each shift. Leave Sticky in place. For the example shown, the result is:

    +-------------------+-------+
  ± | 1 • c d e G R 0 0 | 0 0 S |  *  2**(k - 4)
    +-------------------+-------+

Care is taken never to promote nonzero Sticky by shifting it left into the significant bits. However, it is only in the case of subtraction of nearly identical magnitudes in add() and sub() that a normalization shift of more than one bit can arise. In that case S is guaranteed to be zero. Left shifts of Sticky are not a problem, in practice.

A strategy

After filtering zero results and normalizing, we have a value of the form

    +-------------------+-------+
  ± | 1 • b b b b b b b | G R S |  *  2**k
    +-------------------+------ +

The traditional approach was to round the value down to eight significant bits and then check for k out of range. In the IEEE 754 era, we exploit the bottom right corner of the parallelogram of values by first checking for \( k < e_{min} \) and denormalizing if necessary.

Check for underflow

If k is less than the minimum normal exponent, we have exponent underflow.

Shift the significant bits right, adding \( 1 \) to k for each bit shifted, until it reaches the minimum normal exponent. Zero bits shift in from the left and bits shifted off the right are logically OR-ed into S.

Underflow may result in a value like this

    +-------------------+-------+
  ± | 0 • 0 0 0 0 1 b b | G R S |  *  2**e_min
    +-------------------+------ +

or in an extreme case all the significant bits may be shifted into S. The rounding rules still apply.

    +-------------------+-------+
  ± | 0 • 0 0 0 0 0 0 0 | 0 0 1 |  *  2**e_min
    +-------------------+------ +

Round

After possible denormalization, the value looks like this:

    +-------------------+-------+
  ± | b • b b b b b b b | G R S |  *  2**k
    +-------------------+------ +

In the mainframe era of the 1960s, most computers would simply truncate the result by treating all bits from G rightward as zero. Some minicomputers of the 1970s would add \( 1 \) to the G bit and truncate that result. Visit the Dinosaur Gallery for further information (when it appears).

IEEE 754 ushered in a whole new era by leveraging the power of the new microprocessors to support four different kinds of rounding.

Here is what we mean by rounding. Given a value

    +-------------------+-------+
  ± | b • b b b b b b b | G R S |
    +-------------------+------ +

regardless of sign, we round up (in magnitude) by adding \( 1 \) into the lowest-order b. If every b is a 1, then there is a carry out of the leading 1, so the sigificant bits must be right-shifted one place and the exponent k incremented.

We round down by taking no action on the significant bits. In either case, we discard G, R, and S after rounding.

There are the four types of rounding specified by IEEE 754.

Round toward Zero – this is classic mainframe truncation, where we round down
Round toward \( + \infty \) – if the sign is \( + \) and any of G, R, and S is nonzero, then round up; otherwise, round down
Round toward \( - \infty \) – if the sign is \( - \) and any of G, R, and S is nonzero, then round up; otherwise round down
Round to Nearest – round up if G is 1 and either (a) R or S is 1 or (b) the lowest-order b is 1; otherwise, round down

Despite the technical language, the concepts are simple. Let's think of G, R, and S as the rounding bits. If they're all zero, the result needs no rounding.

If the rounding bits are nonzero, then the intermediate value lies between representable values in the parallelogram. Round toward Zero, or toward \( + \infty \), or toward \( - \infty \) chooses the representable value in the relevant direction.

Round to Nearest chooses the nearer of the two adjacent representable values. We say this rounding is unbiased because it breaks ties on the basis of the bit just left of G. It's reasonable to expect that bit to be 0 or 1 with equal probability.

It's helpful to list the cases:

  ± | b • b b b b b b b | 0 R S   round down
  ± | b • b b b b b b b | 1 1 S   round up
  ± | b • b b b b b b b | 1 R 1   round up
  ± | b • b b b b b b 0 | 1 0 0   round down
  ± | b • b b b b b b 1 | 1 0 0   round up

The last two cases illustrate an intermediate result halfway between two representable values. The rounding is unbiased in that it rounds up only if the least significant bit, just to the left, is one.

Check for overflow

After rounding, we have a value

    +-------------------+
  ± | b • b b b b b b b |  *  2**k
    +-------------------+

with k no less than the smallest normal exponent. If k is larger than the largest normal exponent, the exponent overflow arises. The magnitude is too large to represent.

Overflow is an extraordinary circumstance, in which the delivered result provides no more than an upper or lower bound on the magnitude of the mathematical result. The IEEE 754 approach to overflow is to deliver either \( \infty \) or the maximum normal number, with the sign of the result, depending on the rounding mode and sign of the result.

Signed \( + \infty \) – when rounding to nearest, rounding toward \( + \infty \) with a positive result, or rounding toward \( - \infty \) with a negative result, then deliver \( \infty \) with appropriate sign.
Largest normal number – otherwise, when rounding toward zero, rounding toward \( + \infty \) with a negative result, or rounding toward \( - \infty \) with a positive result, then deliver the largest normal number with appropriate sign.

Next Home Previous