DeepSeek-R1-Distill-Qwen-14B-FP8

FP8-quantized version of DeepSeek-R1-Distill-Qwen-14B, optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%.

Model Overview

  • Base Model: DeepSeek-R1-Distill-Qwen-14B
  • Quantization: FP8 (weights and activations)
  • Memory Reduction: ~50% (from 16-bit to 8-bit)
  • License: MIT License (following original model's license)

Compression Details

Compressed using LLM Compressor with:

  • 512 calibration samples from UltraChat
  • Symmetric per-tensor quantization
  • Applied to linear operators within transformer blocks

The compression script is available in compress.py.

Requirements

  • vLLM
  • transformers
  • torch
  • accelerate

Note

This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.

Downloads last month
5
Safetensors
Model size
14.8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enferAI/DeepSeek-R1-Distill-Qwen-14B-FP8

Quantized
(130)
this model