BNB4 of GoToCompany/Llama-Sahabat-AI-v2-70B-IT
VRAM 40G
Tested on A100
https://huggingface.co/GoToCompany/Llama-Sahabat-AI-v2-70B-IT
NOTE
While the full model weights require approximately 36 GB of VRAM to load, this does not include memory needed for the KV cache, which is essential during inference. KV cache usage depends heavily on your serving backend (vLLM, SGLang, Aphrodite) and your GPU architecture — particularly whether it supports FP8, FP16, or BF16.
For lower VRAM environments, you may consider enabling features such as:
- Attention head swapping
- Paged KV memory
- vRAM↔RAM offloading or CPU-GPU memory swap
These optimizations are backend-specific and require proper configuration. Please consult your inference engine’s documentation for details.
- Downloads last month
- 11
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for komixenon/GoToCompany-Llama-Sahabat-AI-v2-70B-IT-BNB4
Base model
GoToCompany/Llama-Sahabat-AI-v2-70B-IT