Today lets discuss about 32-bit (FP32) and 16-bit (FP16) floating-point!
Floating-point numbers are used to represent real numbers (like decimals) and they consist of three parts:
Sign bit:
Indicates whether thenumber is positive (0) or negative (1).
Exponent:
Determines the scale ofthenumber (i.e., how large or small it is by shifting the decimal point).
Mantissa (or fraction):
Represents the actual digits ofthenumber.
32-bit Floating Point (FP32) Total bits: 32 bits Sign bit: 1 bit Exponent: 8 bits Mantissa: 23 bits For example: A number like -15.375 would be represented as: Sign bit: 1 (negative number) Exponent: Stored after being adjusted by a bias (127 in FP32). Mantissa: The significant digits after converting the number to binary.
16-bit Floating Point (FP16) Total bits: 16 bits Sign bit: 1 bit Exponent: 5 bits Mantissa: 10 bits Example: A number like -15.375 would be stored similarly: Sign bit: 1 (negative number) Exponent: Uses 5 bits, limiting the range compared to FP32. Mantissa: Only 10 bits for precision.
Precision and Range FP32: Higher precision and larger range, with about 7 decimal places of accuracy. FP16: Less precision (around 3-4 decimal places), smaller range but faster computations and less memory use.