Update README.md

9861240 verified about 2 months ago

4.94 kB

	---
	language:
	- en
	library_name: vllm
	pipeline_tag: text-generation
	tags:
	- text-generation
	- conversational
	- compressed-tensors
	- awq
	- w4a16
	- int8
	- quantized
	base_model: TheDrummer/Behemoth-X-123B-v2
	base_model_relation: quantized
	quantized_by: TheHouseOfTheDude
	license: cc-by-nc-4.0
	---

	# Behemoth-X-123B-v2 — Quantized (compressed-tensors for vLLM)

	This repository provides quantized runtime packages of
	[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)
	(a finetune of [mistralai/Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)), packaged for vLLM using the compressed-tensors format.

	> TL;DR
	> - This repo is quantized (e.g., AWQ W4A16_ASYM and INT8 W8A16) for vLLM.
	> - Load with vLLM using `--quantization compressed-tensors`.
	> - Typical AWQ recipe: group_size=128, keep `lm_head` in higher precision; uses the upstream Mistral-Instruct chat template.

	---

	## Revisions & Branches

	> The `main` branch is a placeholder landing branch (model card + links). All runnable artifacts live under per-revision branches.

	- main — placeholder / landing page
	- W4A16-ASYM — AWQ 4-bit weights / 16-bit activations builds and related assets
	- INT8-W8A16 — 8-bit weights / 16-bit activations builds

	Quick links:
	- 🔗 [`main`](https://huggingface.co/TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors/tree/main)
	- 🔗 [`W4A16-ASYM`](https://huggingface.co/TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors/tree/W4A16-ASYM)
	- 🔗 [`INT8-W8A16`](https://huggingface.co/TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors/tree/INT8-W8A16)

	---

	## What’s in this repo (per revision)

	- Sharded quantized weights in `.safetensors` with an index (`model.safetensors.index.json`)
	- `config.json` including compressed-tensors metadata (e.g., `weight_format`, `quantization`, `quantization_config`)
	- Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, etc.)
	- Optional: `chat_template.jinja` (inherits Mistral-Instruct format)

	> Exact files can differ by branch; see the Files and versions tab for each revision.

	---

	## Quickstart — vLLM

	Install vLLM (recent version recommended):

	```bash
	pip install vllm
	```

	Serve (adjust to your hardware):

	```bash
	CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors --quantization compressed-tensors --tensor-parallel-size 8 --max-model-len 65536 --gpu-memory-utilization 0.70 --dtype bfloat16
	```

	Query via Chat Completions:

	```bash
	curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
	"model": "TheHouseOfTheDude/Behemoth-X-123B-v2_Compressed-Tensors",
	"messages": [
	{"role":"system","content":"You are Behemoth-X, helpful, precise, and safe."},
	{"role":"user","content":"Outline a retrieval pipeline for scientific PDFs."}
	],
	"max_tokens": 512,
	"temperature": 0.7,
	"top_p": 0.95
	}'
	```

	> Note: `compressed-tensors` is a vLLM runtime format. Loading this artifact directly in vanilla 🤗 Transformers is not supported; use vLLM for inference. If you need Transformers inference, use a different export (e.g., GPTQ/AWQ compatible with Transformers) or full-precision weights.

	---

	## Prompting / Chat Template

	This package follows the Mistral-Instruct chat conventions from its parent finetune. If a `chat_template.jinja` is present in the branch, `apply_chat_template` will use it automatically.

	---

	## Lineage

	- Base model: [mistralai/Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)
	- Finetuned parent: [TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)
	- This repo: Quantized child of the finetune (compressed-tensors for vLLM)

	---

	## Hardware & Tips (rule-of-thumb)

	- 123B models strongly prefer multi-GPU deployments (e.g., 8× high-VRAM).
	- Long contexts are KV-cache heavy—tune `--max-model-len` and batch size.
	- Prefer BF16 on GPUs with native support; otherwise FP16.
	- Consider CUDA Graphs if stable in your stack.

	---

	## License & Usage

	This distribution inherits the licenses/policies of both the base and finetuned models:

	- Base: [mistralai/Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)
	- Finetune: [TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)

	Use of the model constitutes acceptance of the upstream terms.

	---

	## Changelog

	- v2 (current) — Quantized compressed-tensors exports for Behemoth-X-123B-v2; added W4A16-ASYM and INT8-W8A16 revision branches; updated model card for Quantized classification.