Seed-OSS-36B-Instruct Quantized with NVFP4

This repo contains Seed-OSS-36B-Instruct quantized with NVFP4, suitable for max performance on Nvidia Blackwell hardware (5070, 5080, 5090, RTX Pro 6000, B200, B300, ...).

It can only be run on architectures with hardware FP4 support (Blackwell or later).

Original Model:

ByteDance-Seed/Seed-OSS-36B-Instruct

This model requires ~21.1GB of VRAM however the max context size of 512k tokens requires 128GB of VRAM. Make sure to set an appropriate context size --max-model-len in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.

📥 Usage & Running Instructions

The model was tested with vLLM and 2x RTX Pro 6000, here is a script suitable for such configuration.

export MODEL="mratsim/Seed-OSS-36B-Instruct-NVFP4"
vllm serve "${MODEL}" \
  --served-model-name seed-oss-36b \
  --tensor-parallel-size 2
  --gpu-memory-utilization 0.85

🔬 Quantization method

The llmcompressor library was used with the following recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: NVFP4

and calibrated on 512 samples, 4096 sequence length of mit-han-lab/pile-val-backup

Downloads last month: 37

Safetensors

Model size

21B params

Tensor type

F32

BF16

F8_E4M3

Model tree for mratsim/Seed-OSS-36B-Instruct-NVFP4

Base model

ByteDance-Seed/Seed-OSS-36B-Instruct

Quantized

(40)

this model

mratsim
/

Seed-OSS-36B-Instruct-NVFP4

Seed-OSS-36B-Instruct Quantized with NVFP4

📥 Usage & Running Instructions

🔬 Quantization method

Model tree for mratsim/Seed-OSS-36B-Instruct-NVFP4

Dataset used to train mratsim/Seed-OSS-36B-Instruct-NVFP4