L3.3‑Animus‑V10.0 — Compressed‑Tensors (vLLM runtime)

A quantized release under TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors, packaged for high‑throughput inference with vLLM using the compressed‑tensors runtime format.

TL;DR

  • Load with --quantization compressed-tensors in vLLM.
  • Use the branch selector to pick a quant format. main is a placeholder.
  • Each branch ships sharded .safetensors, tokenizer files, and a config.json declaring quantization_config.

Revisions & Branches

The main branch is a placeholder landing branch (README + pointers). All runnable artifacts live on per‑revision branches.

Branch Scheme Weights Activations Symmetry Notes
W4A16-ASYM Compressed‑Tensors 4‑bit 16‑bit Asymmetric Standard CT recipe (often group size 128; see branch config.json)
W8A16 Custom CT recipe 8‑bit 16‑bit Symmetric (unless specified) Heavier weights; higher fidelity vs W4
W8A16-ASYM Custom CT recipe 8‑bit 16‑bit Asymmetric Custom asymmetric variant

Hot links:

Future revisions may add additional formats (e.g., GPTQ/AWQ/EXL2/EXL3 exports). Check the dropdown or links above from this README.


What’s in each branch

  • Sharded weights (model-00001-of-XXXX.safetensors …) + model.safetensors.index.json
  • config.json with quantization_config (e.g., format: "pack-quantized", quant_method: "compressed-tensors", number of bits, symmetry, group size)
  • Tokenizer artifacts (tokenizer.json, merges/vocab as applicable)
  • Optional: chat_template.jinja if a custom template is required (otherwise the upstream template is used)

Exact files may differ by branch; see the Files and versions tab for the branch you select.


Quantization Notes

  • Compressed‑Tensors (CT) is a runtime format for vLLM. These branches are exported/packed for that runtime.
  • Loaders that expect raw AWQ/GPTQ tensors are not compatible with these branches.
  • Group size is typically 128 for W4A16‑ASYM; other branches may differ. Always defer to the branch’s config.json.
  • Layers like lm_head may be kept in higher precision to preserve output quality (see branch config).

If you need a domain‑matched calibration (e.g., code/legal/chatty dialog), open an issue—additional calibration variants can be added as separate branches.


Quickstart — vLLM

Install vLLM (recent version recommended):

pip install vllm

Serve (choose one branch; example shows W4A16‑ASYM):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors \
  --revision W4A16-ASYM \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.70 \
  --dtype bfloat16

Query via Chat Completions:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors",
    "messages": [
      {"role":"system","content":"You are Animus, helpful, precise, and safe."},
      {"role":"user","content":"List three robust strategies for KV-cache optimization."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Tips

  • Prefer BF16 on GPUs with native support; use FP16 where BF16 is slower/unsupported.
  • Long contexts are KV‑cache heavy—tune --max-model-len and batch size.
  • Match --tensor-parallel-size to your GPU count; enable P2P/NVLink where available.

Prompting / Chat Template

This repo uses the tokenizer and chat format from the upstream L3.3‑Animus‑V10.0 release. If a chat_template.jinja is present in the branch, apply_chat_template will use it automatically.

from transformers import AutoTokenizer

mid = "TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors"
tok = AutoTokenizer.from_pretrained(mid, use_fast=True, trust_remote_code=True, revision="W4A16-ASYM")

messages = [
    {"role": "system", "content": "You are Animus, helpful, precise, and safe."},
    {"role": "user", "content": "Give three strategies for KV-cache optimization."}
]

input_ids = tok.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
)

Intended Use & Compatibility

  • General instruction following, long‑form drafting, multi‑step reasoning, and RAG/agent pipelines.
  • Supported: vLLM with --quantization compressed-tensors (the purpose of this repo).
  • Not intended: llama.cpp/GGUF; raw AWQ/GPTQ loaders; vanilla 🤗 .from_pretrained() without a CT‑compatible runtime.

License & Usage

This distribution inherits the upstream license(s) and usage policies of the fine‑tuned model. Please review:

Use of the model constitutes acceptance of upstream terms.


Changelog

  • v10.0 (current) — Initial set of Compressed‑Tensors branches: W4A16-ASYM, W8A16, W8A16-ASYM; added README and branch links.

Quick Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TheHouseOfTheDude/L3.3-Animus-V10.0_Compressed-Tensors

Quantized
(4)
this model