komixenon's picture
Upload README.md with huggingface_hub
bcd0415 verified
---
license: llama3.1
language:
- id
tags:
- llm
- instruction-tuned
- indonesian
- gguf
- quantized
model_type: llama
library_name: transformers
pipeline_tag: text-generation
base_model: GoToCompany/Llama-Sahabat-AI-v2-70B-IT
---
# ๐Ÿฆ™ GGUF of GoToCompany/Llama-Sahabat-AI-v2-70B-IT
### ๐Ÿ”ง VRAM Recommendation
- **40 GB VRAM recommended**
- **Q2 Tested on: RTX 3090**
Original model:
๐Ÿ‘‰ [GoToCompany/Llama-Sahabat-AI-v2-70B-IT](https://huggingface.co/GoToCompany/Llama-Sahabat-AI-v2-70B-IT)
## ๐Ÿ“‰ Perplexity Notes
As expected, **lower precision quantization results in higher perplexity**.
This GGUF version is intended as a side project to support **llama.cpp-based backends**, allowing inference on much lower-spec hardware.
Use cases include:
- ๐Ÿ–ฅ๏ธ **CPU-only inference** (AVX-512 capable CPU recommended)
- ๐ŸŒ **Distributed inference systems** using GGUF quantized models
## โš ๏ธ Model Size & Inference
- The **full model weights require ~25โ€ฏGB of VRAM** to load.
- This **does not include** additional memory required for **KV cache**, which is essential for inference.
## ๐Ÿ“„ Modelfile Included
A prebuilt `Modelfile` is already included for use with **Ollama** for Q2, edit the modelfile model name to change to Q4.
โžก๏ธ See: [Ollama: Modelfile docs](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#build-from-a-gguf-file)
## ๐Ÿง  Optional Optimizations
For lower-VRAM environments, you may consider enabling features like:
- โœ… **Attention head swapping**
> These features are **backend-specific**. Please refer to your inference engineโ€™s documentation for configuration.