komixenon
/

Llama-Sahabat-AI-v2-70B-IT-GGUF

Text Generation

instruction-tuned

Model card Files Files and versions

Llama-Sahabat-AI-v2-70B-IT-GGUF / README.md

komixenon's picture

Upload README.md with huggingface_hub

bcd0415 verified about 1 month ago

|

history blame contribute delete

1.66 kB

	---
	license: llama3.1
	language:
	- id
	tags:
	- llm
	- instruction-tuned
	- indonesian
	- gguf
	- quantized
	model_type: llama
	library_name: transformers
	pipeline_tag: text-generation
	base_model: GoToCompany/Llama-Sahabat-AI-v2-70B-IT
	---
	# 🦙 GGUF of GoToCompany/Llama-Sahabat-AI-v2-70B-IT

	### 🔧 VRAM Recommendation
	- 40 GB VRAM recommended
	- Q2 Tested on: RTX 3090

	Original model:
	👉 [GoToCompany/Llama-Sahabat-AI-v2-70B-IT](https://huggingface.co/GoToCompany/Llama-Sahabat-AI-v2-70B-IT)


	## 📉 Perplexity Notes

	As expected, lower precision quantization results in higher perplexity.
	This GGUF version is intended as a side project to support llama.cpp-based backends, allowing inference on much lower-spec hardware.

	Use cases include:

	- 🖥️ CPU-only inference (AVX-512 capable CPU recommended)
	- 🌐 Distributed inference systems using GGUF quantized models


	## ⚠️ Model Size & Inference

	- The full model weights require ~25 GB of VRAM to load.
	- This does not include additional memory required for KV cache, which is essential for inference.


	## 📄 Modelfile Included

	A prebuilt `Modelfile` is already included for use with Ollama for Q2, edit the modelfile model name to change to Q4.

	➡️ See: [Ollama: Modelfile docs](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#build-from-a-gguf-file)


	## 🧠 Optional Optimizations

	For lower-VRAM environments, you may consider enabling features like:

	- ✅ Attention head swapping

	> These features are backend-specific. Please refer to your inference engine’s documentation for configuration.