Behemoth-X-123B-v2 — Quantized (compressed-tensors for vLLM)

This repository provides quantized runtime packages of
TheDrummer/Behemoth-X-123B-v2
(a finetune of mistralai/Mistral-Large-Instruct-2411), packaged for vLLM using the compressed-tensors format.

TL;DR

This repo is quantized (e.g., AWQ W4A16_ASYM and INT8 W8A16) for vLLM.

Load with vLLM using --quantization compressed-tensors.

Typical AWQ recipe: group_size=128, keep lm_head in higher precision; uses the upstream Mistral-Instruct chat template.

Revisions & Branches

The main branch is a placeholder landing branch (model card + links). All runnable artifacts live under per-revision branches.

main — placeholder / landing page
W4A16-ASYM — AWQ 4-bit weights / 16-bit activations builds and related assets
INT8-W8A16 — 8-bit weights / 16-bit activations builds

Quick links:

🔗 main
🔗 W4A16-ASYM
🔗 INT8-W8A16

What’s in this repo (per revision)

Sharded quantized weights in .safetensors with an index (model.safetensors.index.json)
config.json including compressed-tensors metadata (e.g., weight_format, quantization, quantization_config)
Tokenizer artifacts (tokenizer.json, tokenizer.model, etc.)
Optional: chat_template.jinja (inherits Mistral-Instruct format)

Exact files can differ by branch; see the Files and versions tab for each revision.

Quickstart — vLLM

Install vLLM (recent version recommended):

pip install vllm

Serve (adjust to your hardware):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors   --quantization compressed-tensors   --tensor-parallel-size 8   --max-model-len 65536   --gpu-memory-utilization 0.70   --dtype bfloat16

Query via Chat Completions:

curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors",
    "messages": [
      {"role":"system","content":"You are Behemoth-X, helpful, precise, and safe."},
      {"role":"user","content":"Outline a retrieval pipeline for scientific PDFs."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Note: compressed-tensors is a vLLM runtime format. Loading this artifact directly in vanilla 🤗 Transformers is not supported; use vLLM for inference. If you need Transformers inference, use a different export (e.g., GPTQ/AWQ compatible with Transformers) or full-precision weights.

Prompting / Chat Template

This package follows the Mistral-Instruct chat conventions from its parent finetune. If a chat_template.jinja is present in the branch, apply_chat_template will use it automatically.

Lineage

Base model: mistralai/Mistral-Large-Instruct-2411
Finetuned parent: TheDrummer/Behemoth-X-123B-v2
This repo: Quantized child of the finetune (compressed-tensors for vLLM)

Hardware & Tips (rule-of-thumb)

123B models strongly prefer multi-GPU deployments (e.g., 8× high-VRAM).
Long contexts are KV-cache heavy—tune --max-model-len and batch size.
Prefer BF16 on GPUs with native support; otherwise FP16.
Consider CUDA Graphs if stable in your stack.

License & Usage

This distribution inherits the licenses/policies of both the base and finetuned models:

Base: mistralai/Mistral-Large-Instruct-2411
Finetune: TheDrummer/Behemoth-X-123B-v2

Use of the model constitutes acceptance of the upstream terms.

Changelog

v2 (current) — Quantized compressed-tensors exports for Behemoth-X-123B-v2; added W4A16-ASYM and INT8-W8A16 revision branches; updated model card for Quantized classification.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors

Base model

mistralai/Mistral-Large-Instruct-2411

Finetuned

TheDrummer/Behemoth-X-123B-v2

Quantized

(8)

this model