komixenon's picture
Upload README.md with huggingface_hub
bcd0415 verified
metadata
license: llama3.1
language:
  - id
tags:
  - llm
  - instruction-tuned
  - indonesian
  - gguf
  - quantized
model_type: llama
library_name: transformers
pipeline_tag: text-generation
base_model: GoToCompany/Llama-Sahabat-AI-v2-70B-IT

πŸ¦™ GGUF of GoToCompany/Llama-Sahabat-AI-v2-70B-IT

πŸ”§ VRAM Recommendation

  • 40 GB VRAM recommended
  • Q2 Tested on: RTX 3090

Original model:
πŸ‘‰ GoToCompany/Llama-Sahabat-AI-v2-70B-IT

πŸ“‰ Perplexity Notes

As expected, lower precision quantization results in higher perplexity.
This GGUF version is intended as a side project to support llama.cpp-based backends, allowing inference on much lower-spec hardware.

Use cases include:

  • πŸ–₯️ CPU-only inference (AVX-512 capable CPU recommended)
  • 🌐 Distributed inference systems using GGUF quantized models

⚠️ Model Size & Inference

  • The full model weights require ~25β€―GB of VRAM to load.
  • This does not include additional memory required for KV cache, which is essential for inference.

πŸ“„ Modelfile Included

A prebuilt Modelfile is already included for use with Ollama for Q2, edit the modelfile model name to change to Q4.

➑️ See: Ollama: Modelfile docs

🧠 Optional Optimizations

For lower-VRAM environments, you may consider enabling features like:

  • βœ… Attention head swapping

These features are backend-specific. Please refer to your inference engine’s documentation for configuration.