File size: 1,656 Bytes
929c7c4
 
 
 
 
 
 
 
bcd0415
 
929c7c4
 
 
 
 
 
1ef0f7c
929c7c4
 
 
1ef0f7c
929c7c4
 
1ef0f7c
 
929c7c4
1ef0f7c
929c7c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
license: llama3.1
language:
  - id
tags:
  - llm
  - instruction-tuned
  - indonesian
  - gguf
  - quantized
model_type: llama
library_name: transformers
pipeline_tag: text-generation
base_model: GoToCompany/Llama-Sahabat-AI-v2-70B-IT
---
# 🦙 GGUF of GoToCompany/Llama-Sahabat-AI-v2-70B-IT

### 🔧 VRAM Recommendation
- **40 GB VRAM recommended**  
- **Q2 Tested on: RTX 3090**

Original model:  
👉 [GoToCompany/Llama-Sahabat-AI-v2-70B-IT](https://huggingface.co/GoToCompany/Llama-Sahabat-AI-v2-70B-IT)


## 📉 Perplexity Notes

As expected, **lower precision quantization results in higher perplexity**.  
This GGUF version is intended as a side project to support **llama.cpp-based backends**, allowing inference on much lower-spec hardware.

Use cases include:

- 🖥️ **CPU-only inference** (AVX-512 capable CPU recommended)  
- 🌐 **Distributed inference systems** using GGUF quantized models


## ⚠️ Model Size & Inference

- The **full model weights require ~25 GB of VRAM** to load.
- This **does not include** additional memory required for **KV cache**, which is essential for inference.


## 📄 Modelfile Included

A prebuilt `Modelfile` is already included for use with **Ollama** for Q2, edit the modelfile model name to change to Q4.

➡️ See: [Ollama: Modelfile docs](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#build-from-a-gguf-file)


## 🧠 Optional Optimizations

For lower-VRAM environments, you may consider enabling features like:

-**Attention head swapping**

> These features are **backend-specific**. Please refer to your inference engine’s documentation for configuration.