BNB4 of GoToCompany/Llama-Sahabat-AI-v2-70B-IT

VRAM 40G

Tested on A100

https://huggingface.co/GoToCompany/Llama-Sahabat-AI-v2-70B-IT

NOTE

While the full model weights require approximately 36 GB of VRAM to load, this does not include memory needed for the KV cache, which is essential during inference. KV cache usage depends heavily on your serving backend (vLLM, SGLang, Aphrodite) and your GPU architecture — particularly whether it supports FP8, FP16, or BF16.

For lower VRAM environments, you may consider enabling features such as:

Attention head swapping
Paged KV memory
vRAM↔RAM offloading or CPU-GPU memory swap

These optimizations are backend-specific and require proper configuration. Please consult your inference engine’s documentation for details.

komixenon
/

GoToCompany-Llama-Sahabat-AI-v2-70B-IT-BNB4

BNB4 of GoToCompany/Llama-Sahabat-AI-v2-70B-IT

VRAM 40G

Tested on A100

NOTE

Model tree for komixenon/GoToCompany-Llama-Sahabat-AI-v2-70B-IT-BNB4