phaedawg's picture
Update README.md
9861240 verified
---
language:
- en
library_name: vllm
pipeline_tag: text-generation
tags:
- text-generation
- conversational
- compressed-tensors
- awq
- w4a16
- int8
- quantized
base_model: TheDrummer/Behemoth-X-123B-v2
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: cc-by-nc-4.0
---
# Behemoth-X-123B-v2 β€” **Quantized** (compressed-tensors for vLLM)
This repository provides **quantized runtime packages** of
**[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)**
(a finetune of **[mistralai/Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)**), packaged for **vLLM** using the **compressed-tensors** format.
> **TL;DR**
> - **This repo is quantized** (e.g., **AWQ W4A16_ASYM** and **INT8 W8A16**) for **vLLM**.
> - Load with **vLLM** using `--quantization compressed-tensors`.
> - Typical AWQ recipe: **group_size=128**, keep `lm_head` in higher precision; uses the upstream **Mistral-Instruct** chat template.
---
## Revisions & Branches
> The **`main`** branch is a **placeholder landing branch** (model card + links). All runnable artifacts live under per-revision branches.
- **main** β€” placeholder / landing page
- **W4A16-ASYM** β€” AWQ 4-bit weights / 16-bit activations builds and related assets
- **INT8-W8A16** β€” 8-bit weights / 16-bit activations builds
**Quick links:**
- πŸ”— **[`main`](https://huggingface.co/TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors/tree/main)**
- πŸ”— **[`W4A16-ASYM`](https://huggingface.co/TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors/tree/W4A16-ASYM)**
- πŸ”— **[`INT8-W8A16`](https://huggingface.co/TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors/tree/INT8-W8A16)**
---
## What’s in this repo (per revision)
- **Sharded quantized weights** in `.safetensors` with an index (`model.safetensors.index.json`)
- `config.json` including **compressed-tensors** metadata (e.g., `weight_format`, `quantization`, `quantization_config`)
- Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, etc.)
- Optional: `chat_template.jinja` (inherits **Mistral-Instruct** format)
> Exact files can differ by branch; see the **Files and versions** tab for each revision.
---
## Quickstart β€” vLLM
Install vLLM (recent version recommended):
```bash
pip install vllm
```
Serve (adjust to your hardware):
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors --quantization compressed-tensors --tensor-parallel-size 8 --max-model-len 65536 --gpu-memory-utilization 0.70 --dtype bfloat16
```
Query via **Chat Completions**:
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors",
"messages": [
{"role":"system","content":"You are Behemoth-X, helpful, precise, and safe."},
{"role":"user","content":"Outline a retrieval pipeline for scientific PDFs."}
],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.95
}'
```
> **Note:** `compressed-tensors` is a **vLLM runtime format**. Loading this artifact directly in vanilla πŸ€— Transformers is not supported; use vLLM for inference. If you need Transformers inference, use a different export (e.g., GPTQ/AWQ compatible with Transformers) or full-precision weights.
---
## Prompting / Chat Template
This package follows the **Mistral-Instruct** chat conventions from its parent finetune. If a `chat_template.jinja` is present in the branch, `apply_chat_template` will use it automatically.
---
## Lineage
- **Base model:** [mistralai/Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)
- **Finetuned parent:** [TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)
- **This repo:** **Quantized child** of the finetune (compressed-tensors for vLLM)
---
## Hardware & Tips (rule-of-thumb)
- 123B models strongly prefer **multi-GPU** deployments (e.g., 8Γ— high-VRAM).
- Long contexts are **KV-cache** heavyβ€”tune `--max-model-len` and batch size.
- Prefer **BF16** on GPUs with native support; otherwise **FP16**.
- Consider CUDA Graphs if stable in your stack.
---
## License & Usage
This distribution inherits the licenses/policies of both the **base** and **finetuned** models:
- Base: **[mistralai/Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)**
- Finetune: **[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)**
Use of the model constitutes acceptance of the upstream terms.
---
## Changelog
- **v2 (current)** β€” Quantized compressed-tensors exports for Behemoth-X-123B-v2; added **W4A16-ASYM** and **INT8-W8A16** revision branches; updated model card for **Quantized** classification.