#CS250#Information#Computer-Science

  • Floating point mimics scientific notation
    • where a = significant digits
    • b = order of magnitude
    • r = radix or base
    • Example
  • Floating point is very important
  • Used for high value computations
  • Floating point processors used to be an entirely separate component
  • Floating point representation has more variation
    • You need to know precision and range,
    • also two plus or minus signs, so store with sign magnitude or with 2’s complement?
    • How many range bits and precision bits do we choose?
    • If each computer uses a different floating point interpretation, then we can’t move software from one to another very easily!
  • Software is near non-portable due to casting issues
  • To solve these issues, a standardized floating point representation was made

IEEE Floating Point Representation

  • Floating point format specified at a bit string level
  • Has rounded results, and even has exceptions/error codes for better error handling
  • Floating point software and data can now run unmodified on a lot of computers
  • One major issue, addition of floating point numbers is not associative
  • Range of values for floating point numbers: about

Format basics

  • Sign | Biased Exponent | Mantissa (with a hidden 1 as the most significant bit)
    • Also can be written as s(ign)|E(xponent)|M(antissa) - s|E|M
  • Sign field pairs with the mantissa
    • 0 is positive, 1 is negative
  • Exponent representation is a biased Unsigned Integer
    • Range of an 8 bit integer exponent is -127 to 127
    • We represent that exponent as itself + a bias value of 127.
    • Now the range is , the sign of the exponent is implicit
    • Biased exponents let us interpret the whole thing as one sign magnitude representation for easy comparison
  • Mantissa values are normalized
    • In binary the value must be 1. … .. ..
    • Because that first value is always a 1, then we don’t need to store that bit. We can just add it in with the circuit itself and strip it out when storing
    • Extra precision at no memory cost!

Floating Point Error Handling

  • If the sign is 0 or 1, and everything else is 0, then it encodes plus or minus 0.
  • When the Exponent is 0 and the mantissa bits are not 0 EXCEPT the MSB is 0, then that encodes a denormalized number
    • This enables gradual underflow reducing total underflow noise
  • Sign of 0/1, Exponent of 255, and Mantissa bits of 0 encodes plus or minus infinity
    • Plus or minus overflow is set to plus or minus infinity, allowing you to respond to overflow errors
  • Exponent of 255 and mantissa bits of not all 0’s encodes Not a Number (NaN)

Rounding

  • They built in rounding in a way that also retains precision
  • Inaccurate rounding injects noise (error) into a computation
  • Rounding modes in IEEE Floating Point Standard
    • Round to nearest
      • As if the computation was done with infinite precision
      • Default rounding mode for standard
    • Round towards positive/negative infinity
    • Round towards 0
      • Truncation
      • Very noisy form of rounding
      • Kept purely out of compatibility really

Interpreting Floating Point