π¦ GGUF of GoToCompany/Llama-Sahabat-AI-v2-70B-IT
π§ VRAM Recommendation
- 40 GB VRAM recommended
- Q2 Tested on: RTX 3090
Original model:
π GoToCompany/Llama-Sahabat-AI-v2-70B-IT
π Perplexity Notes
As expected, lower precision quantization results in higher perplexity.
This GGUF version is intended as a side project to support llama.cpp-based backends, allowing inference on much lower-spec hardware.
Use cases include:
- π₯οΈ CPU-only inference (AVX-512 capable CPU recommended)
- π Distributed inference systems using GGUF quantized models
β οΈ Model Size & Inference
- The full model weights require ~25β―GB of VRAM to load.
- This does not include additional memory required for KV cache, which is essential for inference.
π Modelfile Included
A prebuilt Modelfile
is already included for use with Ollama for Q2, edit the modelfile model name to change to Q4.
β‘οΈ See: Ollama: Modelfile docs
π§ Optional Optimizations
For lower-VRAM environments, you may consider enabling features like:
- β Attention head swapping
These features are backend-specific. Please refer to your inference engineβs documentation for configuration.
- Downloads last month
- 12
Hardware compatibility
Log In
to view the estimation
2-bit
4-bit
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for komixenon/Llama-Sahabat-AI-v2-70B-IT-GGUF
Base model
GoToCompany/Llama-Sahabat-AI-v2-70B-IT