Question about model size

#4
by mhenning - opened

Hi,

thanks for the model upload! I wanted to ask how the model is only 33 GB with AWQ quants. In my (beginner) understanding, a 120B model is ~240 GB at fp16, ~120 GB at fp8 and ~60 GB at 4-bit. So how is this quantization so much smaller? And did the quality suffer much more than at 60 GB quants?
Thanks in advance.

I've been MIA for the last two weeks working on gpt-oss20b/120b awq w4a16 v0.2.0 model release.

I will eval this version against v0.2.0 and fp16. Will drop the evals tonight.

The reason there isn't any other AWQ or GPTQ quants out for this model is because it requires reverse engineering the model layers.
Quantization libraries (llmcompressor, AutoAWQ, GPTQ-for-LLaMA, etc.) look for known module classes like nn.Linear, LlamaDecoderLayer, etc. They use regex rules such as re:.*q_proj$ → apply quantizer.

In gpt-oss, many projections and feed-forward weights live in differently named modules (often wrapped in ColumnLinear / RowLinear / GatedMLP from the OSS repo).

So before you can use AWQ/GPTQ recipes, you have to map these custom layers to equivalent targets (self_attn.q_proj, mlp.gate_proj, etc.).

The current model files v0.1.0 are the first lossless quants for gpt-oss to hit the internet.

I put out v0.1.0 too quickly, without thorough testing and evals. Sorry if you were not able to get the model running. The quantization routine for v0.1.0 was written in pure torch. For stability, I have cut the pure torch route.

v0.2.0 will drop tonight with evals and functionality for vLLM, SGLang, and Tensorrt-LLM.

v0.3.0 will come out mid to late September.
Will have undergone sparse fine tuning, and RLHF and SFT post training and a trained speculator model.

Model will be this size or smaller and have better evals than original model in FP16

gpt-oss-120b is 224GB at full FP16 precision.

First pass: 2:4 sparsity - 50% reduction in model size / memory footprint. Reduces us down to 112GB.

Second pass :- int4-awq-w4a16 quantization - fp16 to int4 gets rid of 75% of the precision. 112GB -> 28GB.

fp16 to int4 = 75% reduction in bytes.

For a signed INT4, 1 bit is for the sign and the remaining 3 bits are for the value, allowing for 16 distinct integer values, typically ranging from -8 to 7.

fp16 uses 1 bit for sign, 5 bits for exponent, and 10 bits for mantissa. Faster computation on compatible hardware (like NVIDIA Tensor Cores) and halves memory usage compared to FP32.

The model is about 33GB the 5GB between 28GB and 33GB Is from upscaling - router, lm_head and few other fragile layers to FP16 and FP32.

Model v0.2.0 will have TF32 Optimization enabled for FP32 data types.

Understanding FP32 (Single-Precision Floating Point)
FP32, or single-precision floating point, is a standard 32-bit format widely used in scientific computing and deep learning. It consists of:

FP32
1 sign bit (for positive/negative values)
8 exponent bits (for scaling)
23 mantissa bits (for precision)
FP32 provides high numerical accuracy, making it suitable for applications requiring precise calculations. However, its computational demands can slow down training times for large neural networks.

vs.

TF32 (TensorFloat-32)
TF32 is a specialized format introduced by NVIDIA that accelerates MatMal operations in AI/ML workloads. TF32 represents values from FP32’s 32-bits to 19-bits. Similar to the function of BF16, TF32 uses the same 8-bits for defining exponents to maintain the same range and ease of translation when working with FP32 calculations while using a 10-bit mantissa/fraction from FP16. Because it uses the same 8-bits as FP32, TF32 is another way NVIDIA GPUs can execute FP32 calculations with ease with a shorter fraction value that is rounded.

Model composition for upcoming SM86 Optimized gpt-oss 20b and 120b models:

2 - 4 layers will be FP32 with TF32 optimization
2 - 4 layers will be FP16
Linear Layers + Experts will be different variants of int4 and/or int8. (Even mixed precision models and 2:4 sparsity (SM86 optimized 1.5x to 2x) in select models)
INT4 and INT8 kv cache quantization - tied to a specific inference engine are in the works too.

A how to run it would be useful since i can't get this model to run on any backend :/

Sign up or log in to comment