Precision: FP32 vs FP16 (and BF16)
This project saves dequantized checkpoints in FP16 (bf16 -> fp16)
- FP32 (single precision, 32-bit, 4 bytes/param) Reference/default precision in many frameworks. Highest numerical range/precision, largest memory.
- FP16 (half precision, 16-bit, 2 bytes/param) Half the memory of FP32. Great for inference on modern GPUs; may underflow/overflow more easily than BF16.
- BF16 (bfloat16, 16-bit, 2 bytes/param) Same memory as FP16, wider exponent like FP32, often more numerically robust than FP16; slightly less precision in mantissa.
In this repo, output precision is FP16 (default) or BF16 via
--dtype
. FP32 output is not offered because it doubles disk/RAM vs FP16/BF16 with minimal inference benefit on modern hardware.
Memory math (example: 120B parameters)
Each parameter stores one number:
Format | Bits | Bytes/param | Approx size for 120B params |
---|---|---|---|
FP32 | 32 | 4 | ~ 447 GiB |
FP16 | 16 | 2 | ~ 224 GiB |
BF16 | 16 | 2 | ~ 224 GiB |
Calculation (GiB):
params * bytes_per_param / 1024^3
For 120,000,000,000 params: FP32: 480e9 B ≈ 447.03 GiB FP16/BF16: 240e9 B ≈ 223.52 GiB
When to use which
Inference on modern NVIDIA GPUs (Turing+/Ampere+/Ada/Hopper): Use FP16 (default here) or BF16. You’ll get large memory savings and typically equal or faster throughput than FP32 thanks to tensor cores.
Training / Finetuning: Use mixed precision (BF16 or FP16 compute with an FP32 master copy of weights/optimizer states). If your GPU supports BF16 well (e.g., A100/H100), BF16 is preferred for numeric stability. (This tool focuses on exporting dequantized checkpoints, not training loops.)
If you hit numeric issues in FP16: Try BF16 (
--dtype bf16
). Same size as FP16 but usually more stable due to FP32-like exponent range.
Notes
- FP32 remains the gold standard for numeric headroom and deterministic baselines, but for inference it’s typically unnecessary and costly (2× memory vs FP16/BF16).
- Tensor cores accelerate FP16/BF16 GEMMs on most modern NVIDIA GPUs; FP32 is often slower and more memory-bound.
- If a downstream runtime expects a specific dtype, export to that: FP16 for speed/memory, BF16 for robustness.
WIP
- Upcoming models: cleaned FP16 release (uniform fp16 with fp32 LayerNorms), compressed variants (W8A8, W4A16, mixed experts), 2:4 sparse checkpoints.
- Evals: MMLU, HellaSwag, TruthfulQA, GSM8K, BBH, MT‑Bench; plus latency/throughput and memory footprint on 3090/A100.
- Extras: scripted upload tooling, detailed model cards, and reproducible Docker workflows.
- Downloads last month
- 48
Model tree for twhitworth/gpt-oss-120b-fp16
Base model
openai/gpt-oss-120b