|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: vllm |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- text-generation |
|
|
- conversational |
|
|
- compressed-tensors |
|
|
- awq |
|
|
- w4a16 |
|
|
- int8 |
|
|
- quantized |
|
|
base_model: TheDrummer/Behemoth-X-123B-v2 |
|
|
base_model_relation: quantized |
|
|
quantized_by: TheHouseOfTheDude |
|
|
license: cc-by-nc-4.0 |
|
|
--- |
|
|
|
|
|
# Behemoth-X-123B-v2 β **Quantized** (compressed-tensors for vLLM) |
|
|
|
|
|
This repository provides **quantized runtime packages** of |
|
|
**[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)** |
|
|
(a finetune of **[mistralai/Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)**), packaged for **vLLM** using the **compressed-tensors** format. |
|
|
|
|
|
> **TL;DR** |
|
|
> - **This repo is quantized** (e.g., **AWQ W4A16_ASYM** and **INT8 W8A16**) for **vLLM**. |
|
|
> - Load with **vLLM** using `--quantization compressed-tensors`. |
|
|
> - Typical AWQ recipe: **group_size=128**, keep `lm_head` in higher precision; uses the upstream **Mistral-Instruct** chat template. |
|
|
|
|
|
--- |
|
|
|
|
|
## Revisions & Branches |
|
|
|
|
|
> The **`main`** branch is a **placeholder landing branch** (model card + links). All runnable artifacts live under per-revision branches. |
|
|
|
|
|
- **main** β placeholder / landing page |
|
|
- **W4A16-ASYM** β AWQ 4-bit weights / 16-bit activations builds and related assets |
|
|
- **INT8-W8A16** β 8-bit weights / 16-bit activations builds |
|
|
|
|
|
**Quick links:** |
|
|
- π **[`main`](https://huggingface.co/TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors/tree/main)** |
|
|
- π **[`W4A16-ASYM`](https://huggingface.co/TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors/tree/W4A16-ASYM)** |
|
|
- π **[`INT8-W8A16`](https://huggingface.co/TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors/tree/INT8-W8A16)** |
|
|
|
|
|
--- |
|
|
|
|
|
## Whatβs in this repo (per revision) |
|
|
|
|
|
- **Sharded quantized weights** in `.safetensors` with an index (`model.safetensors.index.json`) |
|
|
- `config.json` including **compressed-tensors** metadata (e.g., `weight_format`, `quantization`, `quantization_config`) |
|
|
- Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, etc.) |
|
|
- Optional: `chat_template.jinja` (inherits **Mistral-Instruct** format) |
|
|
|
|
|
> Exact files can differ by branch; see the **Files and versions** tab for each revision. |
|
|
|
|
|
--- |
|
|
|
|
|
## Quickstart β vLLM |
|
|
|
|
|
Install vLLM (recent version recommended): |
|
|
|
|
|
```bash |
|
|
pip install vllm |
|
|
``` |
|
|
|
|
|
Serve (adjust to your hardware): |
|
|
|
|
|
```bash |
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors --quantization compressed-tensors --tensor-parallel-size 8 --max-model-len 65536 --gpu-memory-utilization 0.70 --dtype bfloat16 |
|
|
``` |
|
|
|
|
|
Query via **Chat Completions**: |
|
|
|
|
|
```bash |
|
|
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ |
|
|
"model": "TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors", |
|
|
"messages": [ |
|
|
{"role":"system","content":"You are Behemoth-X, helpful, precise, and safe."}, |
|
|
{"role":"user","content":"Outline a retrieval pipeline for scientific PDFs."} |
|
|
], |
|
|
"max_tokens": 512, |
|
|
"temperature": 0.7, |
|
|
"top_p": 0.95 |
|
|
}' |
|
|
``` |
|
|
|
|
|
> **Note:** `compressed-tensors` is a **vLLM runtime format**. Loading this artifact directly in vanilla π€ Transformers is not supported; use vLLM for inference. If you need Transformers inference, use a different export (e.g., GPTQ/AWQ compatible with Transformers) or full-precision weights. |
|
|
|
|
|
--- |
|
|
|
|
|
## Prompting / Chat Template |
|
|
|
|
|
This package follows the **Mistral-Instruct** chat conventions from its parent finetune. If a `chat_template.jinja` is present in the branch, `apply_chat_template` will use it automatically. |
|
|
|
|
|
--- |
|
|
|
|
|
## Lineage |
|
|
|
|
|
- **Base model:** [mistralai/Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411) |
|
|
- **Finetuned parent:** [TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2) |
|
|
- **This repo:** **Quantized child** of the finetune (compressed-tensors for vLLM) |
|
|
|
|
|
--- |
|
|
|
|
|
## Hardware & Tips (rule-of-thumb) |
|
|
|
|
|
- 123B models strongly prefer **multi-GPU** deployments (e.g., 8Γ high-VRAM). |
|
|
- Long contexts are **KV-cache** heavyβtune `--max-model-len` and batch size. |
|
|
- Prefer **BF16** on GPUs with native support; otherwise **FP16**. |
|
|
- Consider CUDA Graphs if stable in your stack. |
|
|
|
|
|
--- |
|
|
|
|
|
## License & Usage |
|
|
|
|
|
This distribution inherits the licenses/policies of both the **base** and **finetuned** models: |
|
|
|
|
|
- Base: **[mistralai/Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)** |
|
|
- Finetune: **[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)** |
|
|
|
|
|
Use of the model constitutes acceptance of the upstream terms. |
|
|
|
|
|
--- |
|
|
|
|
|
## Changelog |
|
|
|
|
|
- **v2 (current)** β Quantized compressed-tensors exports for Behemoth-X-123B-v2; added **W4A16-ASYM** and **INT8-W8A16** revision branches; updated model card for **Quantized** classification. |
|
|
|