Precision: FP32 vs FP16 (and BF16)

This project saves dequantized checkpoints in FP16 (bf16 -> fp16)

FP32 (single precision, 32-bit, 4 bytes/param) Reference/default precision in many frameworks. Highest numerical range/precision, largest memory.
FP16 (half precision, 16-bit, 2 bytes/param) Half the memory of FP32. Great for inference on modern GPUs; may underflow/overflow more easily than BF16.
BF16 (bfloat16, 16-bit, 2 bytes/param) Same memory as FP16, wider exponent like FP32, often more numerically robust than FP16; slightly less precision in mantissa.

In this repo, output precision is FP16 (default) or BF16 via --dtype. FP32 output is not offered because it doubles disk/RAM vs FP16/BF16 with minimal inference benefit on modern hardware.

Memory math (example: 120B parameters)

Each parameter stores one number:

Format	Bits	Bytes/param	Approx size for 120B params
FP32	32	4	~ 447 GiB
FP16	16	2	~ 224 GiB
BF16	16	2	~ 224 GiB

Calculation (GiB): params * bytes_per_param / 1024^3 For 120,000,000,000 params: FP32: 480e9 B ≈ 447.03 GiB FP16/BF16: 240e9 B ≈ 223.52 GiB

When to use which

Inference on modern NVIDIA GPUs (Turing+/Ampere+/Ada/Hopper): Use FP16 (default here) or BF16. You’ll get large memory savings and typically equal or faster throughput than FP32 thanks to tensor cores.
Training / Finetuning: Use mixed precision (BF16 or FP16 compute with an FP32 master copy of weights/optimizer states). If your GPU supports BF16 well (e.g., A100/H100), BF16 is preferred for numeric stability. (This tool focuses on exporting dequantized checkpoints, not training loops.)
If you hit numeric issues in FP16: Try BF16 (--dtype bf16). Same size as FP16 but usually more stable due to FP32-like exponent range.

Notes

FP32 remains the gold standard for numeric headroom and deterministic baselines, but for inference it’s typically unnecessary and costly (2× memory vs FP16/BF16).
Tensor cores accelerate FP16/BF16 GEMMs on most modern NVIDIA GPUs; FP32 is often slower and more memory-bound.
If a downstream runtime expects a specific dtype, export to that: FP16 for speed/memory, BF16 for robustness.

WIP

Upcoming models: cleaned FP16 release (uniform fp16 with fp32 LayerNorms), compressed variants (W8A8, W4A16, mixed experts), 2:4 sparse checkpoints.
Evals: MMLU, HellaSwag, TruthfulQA, GSM8K, BBH, MT‑Bench; plus latency/throughput and memory footprint on 3090/A100.
Extras: scripted upload tooling, detailed model cards, and reproducible Docker workflows.