File size: 1,656 Bytes
929c7c4 bcd0415 929c7c4 1ef0f7c 929c7c4 1ef0f7c 929c7c4 1ef0f7c 929c7c4 1ef0f7c 929c7c4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
---
license: llama3.1
language:
- id
tags:
- llm
- instruction-tuned
- indonesian
- gguf
- quantized
model_type: llama
library_name: transformers
pipeline_tag: text-generation
base_model: GoToCompany/Llama-Sahabat-AI-v2-70B-IT
---
# 🦙 GGUF of GoToCompany/Llama-Sahabat-AI-v2-70B-IT
### 🔧 VRAM Recommendation
- **40 GB VRAM recommended**
- **Q2 Tested on: RTX 3090**
Original model:
👉 [GoToCompany/Llama-Sahabat-AI-v2-70B-IT](https://huggingface.co/GoToCompany/Llama-Sahabat-AI-v2-70B-IT)
## 📉 Perplexity Notes
As expected, **lower precision quantization results in higher perplexity**.
This GGUF version is intended as a side project to support **llama.cpp-based backends**, allowing inference on much lower-spec hardware.
Use cases include:
- 🖥️ **CPU-only inference** (AVX-512 capable CPU recommended)
- 🌐 **Distributed inference systems** using GGUF quantized models
## ⚠️ Model Size & Inference
- The **full model weights require ~25 GB of VRAM** to load.
- This **does not include** additional memory required for **KV cache**, which is essential for inference.
## 📄 Modelfile Included
A prebuilt `Modelfile` is already included for use with **Ollama** for Q2, edit the modelfile model name to change to Q4.
➡️ See: [Ollama: Modelfile docs](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#build-from-a-gguf-file)
## 🧠 Optional Optimizations
For lower-VRAM environments, you may consider enabling features like:
- ✅ **Attention head swapping**
> These features are **backend-specific**. Please refer to your inference engine’s documentation for configuration.
|