L3.3‑Animus‑V10.0 — Compressed‑Tensors (vLLM runtime)
A quantized release under TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors, packaged for high‑throughput inference with vLLM using the compressed‑tensors runtime format.
- Fine‑tuned model: Darkhn/L3.3-Animus-V10.0
TL;DR
- Load with
--quantization compressed-tensors
in vLLM.- Use the branch selector to pick a quant format.
main
is a placeholder.- Each branch ships sharded
.safetensors
, tokenizer files, and aconfig.json
declaringquantization_config
.
Revisions & Branches
The
main
branch is a placeholder landing branch (README + pointers). All runnable artifacts live on per‑revision branches.
Branch | Scheme | Weights | Activations | Symmetry | Notes |
---|---|---|---|---|---|
W4A16-ASYM |
Compressed‑Tensors | 4‑bit | 16‑bit | Asymmetric | Standard CT recipe (often group size 128; see branch config.json ) |
W8A16 |
Custom CT recipe | 8‑bit | 16‑bit | Symmetric (unless specified) | Heavier weights; higher fidelity vs W4 |
W8A16-ASYM |
Custom CT recipe | 8‑bit | 16‑bit | Asymmetric | Custom asymmetric variant |
Hot links:
- 🔗 Browse
main
- 🔗 Browse
W4A16-ASYM
- 🔗 Browse
W8A16
- 🔗 Browse
W8A16-ASYM
Future revisions may add additional formats (e.g., GPTQ/AWQ/EXL2/EXL3 exports). Check the dropdown or links above from this README.
What’s in each branch
- Sharded weights (
model-00001-of-XXXX.safetensors
…) +model.safetensors.index.json
config.json
withquantization_config
(e.g.,format: "pack-quantized"
,quant_method: "compressed-tensors"
, number of bits, symmetry, group size)- Tokenizer artifacts (
tokenizer.json
, merges/vocab as applicable) - Optional:
chat_template.jinja
if a custom template is required (otherwise the upstream template is used)
Exact files may differ by branch; see the Files and versions tab for the branch you select.
Quantization Notes
- Compressed‑Tensors (CT) is a runtime format for vLLM. These branches are exported/packed for that runtime.
- Loaders that expect raw AWQ/GPTQ tensors are not compatible with these branches.
- Group size is typically 128 for W4A16‑ASYM; other branches may differ. Always defer to the branch’s
config.json
. - Layers like
lm_head
may be kept in higher precision to preserve output quality (see branch config).
If you need a domain‑matched calibration (e.g., code/legal/chatty dialog), open an issue—additional calibration variants can be added as separate branches.
Quickstart — vLLM
Install vLLM (recent version recommended):
pip install vllm
Serve (choose one branch; example shows W4A16‑ASYM):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors \
--revision W4A16-ASYM \
--quantization compressed-tensors \
--tensor-parallel-size 8 \
--max-model-len 65536 \
--gpu-memory-utilization 0.70 \
--dtype bfloat16
Query via Chat Completions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors",
"messages": [
{"role":"system","content":"You are Animus, helpful, precise, and safe."},
{"role":"user","content":"List three robust strategies for KV-cache optimization."}
],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.95
}'
Tips
- Prefer BF16 on GPUs with native support; use FP16 where BF16 is slower/unsupported.
- Long contexts are KV‑cache heavy—tune
--max-model-len
and batch size. - Match
--tensor-parallel-size
to your GPU count; enable P2P/NVLink where available.
Prompting / Chat Template
This repo uses the tokenizer and chat format from the upstream L3.3‑Animus‑V10.0 release. If a chat_template.jinja
is present in the branch, apply_chat_template
will use it automatically.
from transformers import AutoTokenizer
mid = "TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors"
tok = AutoTokenizer.from_pretrained(mid, use_fast=True, trust_remote_code=True, revision="W4A16-ASYM")
messages = [
{"role": "system", "content": "You are Animus, helpful, precise, and safe."},
{"role": "user", "content": "Give three strategies for KV-cache optimization."}
]
input_ids = tok.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
)
Intended Use & Compatibility
- General instruction following, long‑form drafting, multi‑step reasoning, and RAG/agent pipelines.
- ✅ Supported: vLLM with
--quantization compressed-tensors
(the purpose of this repo). - ❌ Not intended: llama.cpp/GGUF; raw AWQ/GPTQ loaders; vanilla 🤗
.from_pretrained()
without a CT‑compatible runtime.
License & Usage
This distribution inherits the upstream license(s) and usage policies of the fine‑tuned model. Please review:
Use of the model constitutes acceptance of upstream terms.
Changelog
- v10.0 (current) — Initial set of Compressed‑Tensors branches:
W4A16-ASYM
,W8A16
,W8A16-ASYM
; added README and branch links.
Quick Links
- Fine‑tuned model: Darkhn/L3.3-Animus-V10.0
- This repo (browse branches):
main
·W4A16-ASYM
·W8A16
·W8A16-ASYM
Model tree for TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors
Base model
Darkhn-Graveyard/L3.3-70B-Animus-Base