Precision: FP32 vs FP16 (and BF16)

This project saves dequantized checkpoints in FP16 (bf16 -> fp16)

  • FP32 (single precision, 32-bit, 4 bytes/param) Reference/default precision in many frameworks. Highest numerical range/precision, largest memory.
  • FP16 (half precision, 16-bit, 2 bytes/param) Half the memory of FP32. Great for inference on modern GPUs; may underflow/overflow more easily than BF16.
  • BF16 (bfloat16, 16-bit, 2 bytes/param) Same memory as FP16, wider exponent like FP32, often more numerically robust than FP16; slightly less precision in mantissa.

In this repo, output precision is FP16 (default) or BF16 via --dtype. FP32 output is not offered because it doubles disk/RAM vs FP16/BF16 with minimal inference benefit on modern hardware.

Memory math (example: 120B parameters)

Each parameter stores one number:

Format Bits Bytes/param Approx size for 120B params
FP32 32 4 ~ 447 GiB
FP16 16 2 ~ 224 GiB
BF16 16 2 ~ 224 GiB

Calculation (GiB): params * bytes_per_param / 1024^3 For 120,000,000,000 params: FP32: 480e9 B ≈ 447.03 GiB FP16/BF16: 240e9 B ≈ 223.52 GiB

When to use which

  • Inference on modern NVIDIA GPUs (Turing+/Ampere+/Ada/Hopper): Use FP16 (default here) or BF16. You’ll get large memory savings and typically equal or faster throughput than FP32 thanks to tensor cores.

  • Training / Finetuning: Use mixed precision (BF16 or FP16 compute with an FP32 master copy of weights/optimizer states). If your GPU supports BF16 well (e.g., A100/H100), BF16 is preferred for numeric stability. (This tool focuses on exporting dequantized checkpoints, not training loops.)

  • If you hit numeric issues in FP16: Try BF16 (--dtype bf16). Same size as FP16 but usually more stable due to FP32-like exponent range.

Notes

  • FP32 remains the gold standard for numeric headroom and deterministic baselines, but for inference it’s typically unnecessary and costly (2× memory vs FP16/BF16).
  • Tensor cores accelerate FP16/BF16 GEMMs on most modern NVIDIA GPUs; FP32 is often slower and more memory-bound.
  • If a downstream runtime expects a specific dtype, export to that: FP16 for speed/memory, BF16 for robustness.

WIP

  • Upcoming models: cleaned FP16 release (uniform fp16 with fp32 LayerNorms), compressed variants (W8A8, W4A16, mixed experts), 2:4 sparse checkpoints.
  • Evals: MMLU, HellaSwag, TruthfulQA, GSM8K, BBH, MT‑Bench; plus latency/throughput and memory footprint on 3090/A100.
  • Extras: scripted upload tooling, detailed model cards, and reproducible Docker workflows.
Downloads last month
48
Safetensors
Model size
117B params
Tensor type
F32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for twhitworth/gpt-oss-120b-fp16

Finetuned
(24)
this model