Floating point mimics scientific notation
$\pm a \times r^{\pm b}$
- where a = significant digits
- b = order of magnitude
- r = radix or base
- Example $6.023 \times 1 0^{23}$
Floating point is very important
Used for high value computations
Floating point processors used to be an entirely separate component
Floating point representation has more variation
- You need to know precision and range,
- also two plus or minus signs, so store with sign magnitude or with 2’s complement?
- How many range bits and precision bits do we choose?
- If each computer uses a different floating point interpretation, then we can’t move software from one to another very easily!
Software is near non-portable due to casting issues
To solve these issues, a standardized floating point representation was made

IEEE Floating Point Representation

Floating point format specified at a bit string level
Has rounded results, and even has exceptions/error codes for better error handling
Floating point software and data can now run unmodified on a lot of computers
One major issue, addition of floating point numbers is not associative
Range of values for floating point numbers: $2^{- 126} < - > 2^{127}$ about $1 0^{- 38} < - > 1 0^{38}$

Format basics

Sign | Biased Exponent | Mantissa (with a hidden 1 as the most significant bit)
- Also can be written as s(ign)|E(xponent)|M(antissa) - s|E|M
Sign field pairs with the mantissa
- 0 is positive, 1 is negative
Exponent representation is a biased Unsigned Integer
- Range of an 8 bit integer exponent is -127 to 127
- We represent that exponent as itself + a bias value of 127.
- Now the range is $0 \leq E \leq 255$ , the sign of the exponent is implicit
- Biased exponents let us interpret the whole thing as one sign magnitude representation for easy comparison
Mantissa values are normalized
- In binary the value must be 1. … .. ..
- Because that first value is always a 1, then we don’t need to store that bit. We can just add it in with the circuit itself and strip it out when storing
- Extra precision at no memory cost!

If the sign is 0 or 1, and everything else is 0, then it encodes plus or minus 0.
When the Exponent is 0 and the mantissa bits are not 0 EXCEPT the MSB is 0, then that encodes a denormalized number
- This enables gradual underflow reducing total underflow noise
Sign of 0/1, Exponent of 255, and Mantissa bits of 0 encodes plus or minus infinity
- Plus or minus overflow is set to plus or minus infinity, allowing you to respond to overflow errors
Exponent of 255 and mantissa bits of not all 0’s encodes Not a Number (NaN)