Seed-OSS-36B-Instruct Quantized with NVFP4
This repo contains Seed-OSS-36B-Instruct quantized with NVFP4, suitable for max performance on Nvidia Blackwell hardware (5070, 5080, 5090, RTX Pro 6000, B200, B300, ...).
It can only be run on architectures with hardware FP4 support (Blackwell or later).
Original Model:
This model requires ~21.1GB of VRAM however the max context size of 512k tokens requires 128GB of VRAM.
Make sure to set an appropriate context size --max-model-len
in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.
📥 Usage & Running Instructions
The model was tested with vLLM and 2x RTX Pro 6000, here is a script suitable for such configuration.
export MODEL="mratsim/Seed-OSS-36B-Instruct-NVFP4"
vllm serve "${MODEL}" \
--served-model-name seed-oss-36b \
--tensor-parallel-size 2
--gpu-memory-utilization 0.85
🔬 Quantization method
The llmcompressor library was used with the following recipe:
default_stage:
default_modifiers:
QuantizationModifier:
targets: [Linear]
ignore: [lm_head]
scheme: NVFP4
and calibrated on 512 samples, 4096 sequence length of mit-han-lab/pile-val-backup
- Downloads last month
- 37
Model tree for mratsim/Seed-OSS-36B-Instruct-NVFP4
Base model
ByteDance-Seed/Seed-OSS-36B-Instruct