BNB4 of GoToCompany/Llama-Sahabat-AI-v2-70B-IT

VRAM 40G

Tested on A100

https://huggingface.co/GoToCompany/Llama-Sahabat-AI-v2-70B-IT

NOTE

While the full model weights require approximately 36 GB of VRAM to load, this does not include memory needed for the KV cache, which is essential during inference. KV cache usage depends heavily on your serving backend (vLLM, SGLang, Aphrodite) and your GPU architecture — particularly whether it supports FP8, FP16, or BF16.

For lower VRAM environments, you may consider enabling features such as:

  • Attention head swapping
  • Paged KV memory
  • vRAM↔RAM offloading or CPU-GPU memory swap

These optimizations are backend-specific and require proper configuration. Please consult your inference engine’s documentation for details.

Downloads last month
11
Safetensors
Model size
37.4B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for komixenon/GoToCompany-Llama-Sahabat-AI-v2-70B-IT-BNB4

Quantized
(2)
this model