Behemoth-X-123B-v2 — Quantized (compressed-tensors for vLLM)
This repository provides quantized runtime packages of
TheDrummer/Behemoth-X-123B-v2
(a finetune of mistralai/Mistral-Large-Instruct-2411), packaged for vLLM using the compressed-tensors format.
TL;DR
- This repo is quantized (e.g., AWQ W4A16_ASYM and INT8 W8A16) for vLLM.
- Load with vLLM using
--quantization compressed-tensors
.- Typical AWQ recipe: group_size=128, keep
lm_head
in higher precision; uses the upstream Mistral-Instruct chat template.
Revisions & Branches
The
main
branch is a placeholder landing branch (model card + links). All runnable artifacts live under per-revision branches.
- main — placeholder / landing page
- W4A16-ASYM — AWQ 4-bit weights / 16-bit activations builds and related assets
- INT8-W8A16 — 8-bit weights / 16-bit activations builds
Quick links:
- 🔗
main
- 🔗
W4A16-ASYM
- 🔗
INT8-W8A16
What’s in this repo (per revision)
- Sharded quantized weights in
.safetensors
with an index (model.safetensors.index.json
) config.json
including compressed-tensors metadata (e.g.,weight_format
,quantization
,quantization_config
)- Tokenizer artifacts (
tokenizer.json
,tokenizer.model
, etc.) - Optional:
chat_template.jinja
(inherits Mistral-Instruct format)
Exact files can differ by branch; see the Files and versions tab for each revision.
Quickstart — vLLM
Install vLLM (recent version recommended):
pip install vllm
Serve (adjust to your hardware):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors --quantization compressed-tensors --tensor-parallel-size 8 --max-model-len 65536 --gpu-memory-utilization 0.70 --dtype bfloat16
Query via Chat Completions:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors",
"messages": [
{"role":"system","content":"You are Behemoth-X, helpful, precise, and safe."},
{"role":"user","content":"Outline a retrieval pipeline for scientific PDFs."}
],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.95
}'
Note:
compressed-tensors
is a vLLM runtime format. Loading this artifact directly in vanilla 🤗 Transformers is not supported; use vLLM for inference. If you need Transformers inference, use a different export (e.g., GPTQ/AWQ compatible with Transformers) or full-precision weights.
Prompting / Chat Template
This package follows the Mistral-Instruct chat conventions from its parent finetune. If a chat_template.jinja
is present in the branch, apply_chat_template
will use it automatically.
Lineage
- Base model: mistralai/Mistral-Large-Instruct-2411
- Finetuned parent: TheDrummer/Behemoth-X-123B-v2
- This repo: Quantized child of the finetune (compressed-tensors for vLLM)
Hardware & Tips (rule-of-thumb)
- 123B models strongly prefer multi-GPU deployments (e.g., 8× high-VRAM).
- Long contexts are KV-cache heavy—tune
--max-model-len
and batch size. - Prefer BF16 on GPUs with native support; otherwise FP16.
- Consider CUDA Graphs if stable in your stack.
License & Usage
This distribution inherits the licenses/policies of both the base and finetuned models:
- Base: mistralai/Mistral-Large-Instruct-2411
- Finetune: TheDrummer/Behemoth-X-123B-v2
Use of the model constitutes acceptance of the upstream terms.
Changelog
- v2 (current) — Quantized compressed-tensors exports for Behemoth-X-123B-v2; added W4A16-ASYM and INT8-W8A16 revision branches; updated model card for Quantized classification.
Model tree for TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors
Base model
mistralai/Mistral-Large-Instruct-2411