|
--- |
|
license: llama3.1 |
|
language: |
|
- id |
|
tags: |
|
- llm |
|
- instruction-tuned |
|
- indonesian |
|
- gguf |
|
- quantized |
|
model_type: llama |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
base_model: GoToCompany/Llama-Sahabat-AI-v2-70B-IT |
|
--- |
|
# ๐ฆ GGUF of GoToCompany/Llama-Sahabat-AI-v2-70B-IT |
|
|
|
### ๐ง VRAM Recommendation |
|
- **40 GB VRAM recommended** |
|
- **Q2 Tested on: RTX 3090** |
|
|
|
Original model: |
|
๐ [GoToCompany/Llama-Sahabat-AI-v2-70B-IT](https://huggingface.co/GoToCompany/Llama-Sahabat-AI-v2-70B-IT) |
|
|
|
|
|
## ๐ Perplexity Notes |
|
|
|
As expected, **lower precision quantization results in higher perplexity**. |
|
This GGUF version is intended as a side project to support **llama.cpp-based backends**, allowing inference on much lower-spec hardware. |
|
|
|
Use cases include: |
|
|
|
- ๐ฅ๏ธ **CPU-only inference** (AVX-512 capable CPU recommended) |
|
- ๐ **Distributed inference systems** using GGUF quantized models |
|
|
|
|
|
## โ ๏ธ Model Size & Inference |
|
|
|
- The **full model weights require ~25โฏGB of VRAM** to load. |
|
- This **does not include** additional memory required for **KV cache**, which is essential for inference. |
|
|
|
|
|
## ๐ Modelfile Included |
|
|
|
A prebuilt `Modelfile` is already included for use with **Ollama** for Q2, edit the modelfile model name to change to Q4. |
|
|
|
โก๏ธ See: [Ollama: Modelfile docs](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#build-from-a-gguf-file) |
|
|
|
|
|
## ๐ง Optional Optimizations |
|
|
|
For lower-VRAM environments, you may consider enabling features like: |
|
|
|
- โ
**Attention head swapping** |
|
|
|
> These features are **backend-specific**. Please refer to your inference engineโs documentation for configuration. |
|
|