L3.3‑Animus‑V10.0 — Compressed‑Tensors (vLLM runtime)

A quantized release under TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors, packaged for high‑throughput inference with vLLM using the compressed‑tensors runtime format.

Fine‑tuned model: Darkhn/L3.3-Animus-V10.0

TL;DR

Load with --quantization compressed-tensors in vLLM.

Use the branch selector to pick a quant format. main is a placeholder.

Each branch ships sharded .safetensors, tokenizer files, and a config.json declaring quantization_config.

Revisions & Branches

The main branch is a placeholder landing branch (README + pointers). All runnable artifacts live on per‑revision branches.

Branch	Scheme	Weights	Activations	Symmetry	Notes
`W4A16-ASYM`	Compressed‑Tensors	4‑bit	16‑bit	Asymmetric	Standard CT recipe (often group size 128; see branch `config.json`)
`W8A16`	Custom CT recipe	8‑bit	16‑bit	Symmetric (unless specified)	Heavier weights; higher fidelity vs W4
`W8A16-ASYM`	Custom CT recipe	8‑bit	16‑bit	Asymmetric	Custom asymmetric variant

Hot links:

🔗 Browse main
🔗 Browse W4A16-ASYM
🔗 Browse W8A16
🔗 Browse W8A16-ASYM

Future revisions may add additional formats (e.g., GPTQ/AWQ/EXL2/EXL3 exports). Check the dropdown or links above from this README.

What’s in each branch

Sharded weights (model-00001-of-XXXX.safetensors …) + model.safetensors.index.json
config.json with quantization_config (e.g., format: "pack-quantized", quant_method: "compressed-tensors", number of bits, symmetry, group size)
Tokenizer artifacts (tokenizer.json, merges/vocab as applicable)
Optional: chat_template.jinja if a custom template is required (otherwise the upstream template is used)

Exact files may differ by branch; see the Files and versions tab for the branch you select.

Quantization Notes

Compressed‑Tensors (CT) is a runtime format for vLLM. These branches are exported/packed for that runtime.
Loaders that expect raw AWQ/GPTQ tensors are not compatible with these branches.
Group size is typically 128 for W4A16‑ASYM; other branches may differ. Always defer to the branch’s config.json.
Layers like lm_head may be kept in higher precision to preserve output quality (see branch config).

If you need a domain‑matched calibration (e.g., code/legal/chatty dialog), open an issue—additional calibration variants can be added as separate branches.

Quickstart — vLLM

Install vLLM (recent version recommended):

pip install vllm

Serve (choose one branch; example shows W4A16‑ASYM):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors \
  --revision W4A16-ASYM \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.70 \
  --dtype bfloat16

Query via Chat Completions:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors",
    "messages": [
      {"role":"system","content":"You are Animus, helpful, precise, and safe."},
      {"role":"user","content":"List three robust strategies for KV-cache optimization."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Tips

Prefer BF16 on GPUs with native support; use FP16 where BF16 is slower/unsupported.
Long contexts are KV‑cache heavy—tune --max-model-len and batch size.
Match --tensor-parallel-size to your GPU count; enable P2P/NVLink where available.

Prompting / Chat Template

This repo uses the tokenizer and chat format from the upstream L3.3‑Animus‑V10.0 release. If a chat_template.jinja is present in the branch, apply_chat_template will use it automatically.

from transformers import AutoTokenizer

mid = "TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors"
tok = AutoTokenizer.from_pretrained(mid, use_fast=True, trust_remote_code=True, revision="W4A16-ASYM")

messages = [
    {"role": "system", "content": "You are Animus, helpful, precise, and safe."},
    {"role": "user", "content": "Give three strategies for KV-cache optimization."}
]

input_ids = tok.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
)

Intended Use & Compatibility

General instruction following, long‑form drafting, multi‑step reasoning, and RAG/agent pipelines.
✅ Supported: vLLM with --quantization compressed-tensors (the purpose of this repo).
❌ Not intended: llama.cpp/GGUF; raw AWQ/GPTQ loaders; vanilla 🤗 .from_pretrained() without a CT‑compatible runtime.

License & Usage

This distribution inherits the upstream license(s) and usage policies of the fine‑tuned model. Please review:

Darkhn/L3.3-Animus-V10.0

Use of the model constitutes acceptance of upstream terms.

Changelog

v10.0 (current) — Initial set of Compressed‑Tensors branches: W4A16-ASYM, W8A16, W8A16-ASYM; added README and branch links.

Quick Links

Fine‑tuned model: Darkhn/L3.3-Animus-V10.0
This repo (browse branches): main · W4A16-ASYM · W8A16 · W8A16-ASYM

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors

Base model

Darkhn-Graveyard/L3.3-70B-Animus-Base

Finetuned

Darkhn/L3.3-70B-Animus-V10.0

Quantized

(4)

this model