🦙 GGUF of GoToCompany/Llama-Sahabat-AI-v2-70B-IT

🔧 VRAM Recommendation

40 GB VRAM recommended
Q2 Tested on: RTX 3090

Original model:
👉 GoToCompany/Llama-Sahabat-AI-v2-70B-IT

📉 Perplexity Notes

As expected, lower precision quantization results in higher perplexity.
This GGUF version is intended as a side project to support llama.cpp-based backends, allowing inference on much lower-spec hardware.

Use cases include:

🖥️ CPU-only inference (AVX-512 capable CPU recommended)
🌐 Distributed inference systems using GGUF quantized models

⚠️ Model Size & Inference

The full model weights require ~25 GB of VRAM to load.
This does not include additional memory required for KV cache, which is essential for inference.

📄 Modelfile Included

A prebuilt Modelfile is already included for use with Ollama for Q2, edit the modelfile model name to change to Q4.

➡️ See: Ollama: Modelfile docs

🧠 Optional Optimizations

For lower-VRAM environments, you may consider enabling features like:

✅ Attention head swapping

These features are backend-specific. Please refer to your inference engine’s documentation for configuration.