πŸ¦™ GGUF of GoToCompany/Llama-Sahabat-AI-v2-70B-IT

πŸ”§ VRAM Recommendation

  • 40 GB VRAM recommended
  • Q2 Tested on: RTX 3090

Original model:
πŸ‘‰ GoToCompany/Llama-Sahabat-AI-v2-70B-IT

πŸ“‰ Perplexity Notes

As expected, lower precision quantization results in higher perplexity.
This GGUF version is intended as a side project to support llama.cpp-based backends, allowing inference on much lower-spec hardware.

Use cases include:

  • πŸ–₯️ CPU-only inference (AVX-512 capable CPU recommended)
  • 🌐 Distributed inference systems using GGUF quantized models

⚠️ Model Size & Inference

  • The full model weights require ~25β€―GB of VRAM to load.
  • This does not include additional memory required for KV cache, which is essential for inference.

πŸ“„ Modelfile Included

A prebuilt Modelfile is already included for use with Ollama for Q2, edit the modelfile model name to change to Q4.

➑️ See: Ollama: Modelfile docs

🧠 Optional Optimizations

For lower-VRAM environments, you may consider enabling features like:

  • βœ… Attention head swapping

These features are backend-specific. Please refer to your inference engine’s documentation for configuration.

Downloads last month
12
GGUF
Model size
70.6B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

2-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for komixenon/Llama-Sahabat-AI-v2-70B-IT-GGUF

Quantized
(2)
this model