metadata
license: llama3.1
language:
- id
tags:
- llm
- instruction-tuned
- indonesian
- gguf
- quantized
model_type: llama
library_name: transformers
pipeline_tag: text-generation
base_model: GoToCompany/Llama-Sahabat-AI-v2-70B-IT
π¦ GGUF of GoToCompany/Llama-Sahabat-AI-v2-70B-IT
π§ VRAM Recommendation
- 40 GB VRAM recommended
- Q2 Tested on: RTX 3090
Original model:
π GoToCompany/Llama-Sahabat-AI-v2-70B-IT
π Perplexity Notes
As expected, lower precision quantization results in higher perplexity.
This GGUF version is intended as a side project to support llama.cpp-based backends, allowing inference on much lower-spec hardware.
Use cases include:
- π₯οΈ CPU-only inference (AVX-512 capable CPU recommended)
- π Distributed inference systems using GGUF quantized models
β οΈ Model Size & Inference
- The full model weights require ~25β―GB of VRAM to load.
- This does not include additional memory required for KV cache, which is essential for inference.
π Modelfile Included
A prebuilt Modelfile
is already included for use with Ollama for Q2, edit the modelfile model name to change to Q4.
β‘οΈ See: Ollama: Modelfile docs
π§ Optional Optimizations
For lower-VRAM environments, you may consider enabling features like:
- β Attention head swapping
These features are backend-specific. Please refer to your inference engineβs documentation for configuration.