Seed-OSS-36B-Instruct Quantized with NVFP4

This repo contains Seed-OSS-36B-Instruct quantized with NVFP4, suitable for max performance on Nvidia Blackwell hardware (5070, 5080, 5090, RTX Pro 6000, B200, B300, ...).

It can only be run on architectures with hardware FP4 support (Blackwell or later).

Original Model:

This model requires ~21.1GB of VRAM however the max context size of 512k tokens requires 128GB of VRAM. Make sure to set an appropriate context size --max-model-len in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.

📥 Usage & Running Instructions

The model was tested with vLLM and 2x RTX Pro 6000, here is a script suitable for such configuration.

export MODEL="mratsim/Seed-OSS-36B-Instruct-NVFP4"
vllm serve "${MODEL}" \
  --served-model-name seed-oss-36b \
  --tensor-parallel-size 2
  --gpu-memory-utilization 0.85

🔬 Quantization method

The llmcompressor library was used with the following recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: NVFP4

and calibrated on 512 samples, 4096 sequence length of mit-han-lab/pile-val-backup

Downloads last month
37
Safetensors
Model size
21B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mratsim/Seed-OSS-36B-Instruct-NVFP4

Quantized
(40)
this model

Dataset used to train mratsim/Seed-OSS-36B-Instruct-NVFP4