Next Home Previous

How floating point arithmetic works

This booklet presents binary floating point number systems and special values. You will see how arithmetic operations produce results within the number system. You will see what happens at the edges, where floating point departs from everyday mathematics. You will see how the arithmetic connects to the rest of the computer system, through the input and output of decimal values and the reporting of exceptions like overflow.

The beauty and simplicity of floating point arithmetic are easily lost in a fog of mathematical notation. This presentation emphasizes ideas over notation. You have already seen the basic concepts in scientific notation in school. The speed of light is \(3.00 \times 10^{8} m/sec \) and you have seen the condensed form 3.00E+8 on pocket calculators and in spreadsheets.

First, this presentation is in binary. All of the ideas apply directly to octal (base 8), decimal, and hexadecimal (base 16) arithmetic.

Second, the presentation covers just one hypothetical data type with 8 significant bits. The story is the same for standardized formats like single with 24 significant bits, double with 53, and quadruple with 112 significant bits. Working with 8 bits is easier on the eyes.

Third, we omit the details of the storage format. How the bits are arranged in memory is a vital part of the design, but the encoding has nothing to do with arithmetic. You will find extensive discussion of number encodings current and past outside this pamphlet.

Next Home Previous