Fallen-Command-A-111B-v1 — Quantized (compressed-tensors for vLLM)
This repository provides quantized runtime packages of
TheDrummer/Fallen-Command-A-111B-v1, a finetune of
CohereLabs/c4ai-command-a-03-2025 (aka Command A), repackaged for vLLM using the compressed‑tensors format.
TL;DR
- This repo is quantized with branches W4A16 and W8A16.
- Load with vLLM using
--quantization compressed-tensors
.- Command A (111B) is a dense, enterprise‑oriented model with 256K context, high throughput, and strong capabilities for tool use, agents, RAG, and multilingual tasks.
Revisions & Branches
The
main
branch is a landing page (model card + links). All runnable artifacts live under per‑revision branches.
- main — placeholder / landing page
- W4A16 — 4‑bit weights / 16‑bit activations builds and runtime assets
- W8A16 — 8‑bit weights / 16‑bit activations builds
Repository Contents (per revision)
- Sharded quantized weights in
.safetensors
with an index (model.safetensors.index.json
) config.json
including compressed‑tensors metadata (weight_format
,quantization
,quantization_config
)- Tokenizer artifacts (
tokenizer.json
,tokenizer.model
, etc.) - Optional:
chat_template.jinja
(inherits the parent finetune’s chat format)
Exact files can differ by branch; see the Files and versions tab for each revision.
About Command A (how it differs from Qwen/Qwen3 and others)
- Dense 111B (not MoE): All parameters are active at inference; optimized for throughput and enterprise reliability.
- 256K context: supports very long conversations and documents.
- Enterprise agentic focus: excels at tool use, RAG, agents, and multilingual tasks.
- Efficiency: designed for high tokens/sec and practical deployment footprints compared to similarly strong models.
See the Command A resources for details (technical report, model card, and product docs).
Quantization recipe & implementation notes (from the attached script)
The W4A16 builds in this repo were produced with a modern AWQ recipe via llm‑compressor (AutoAWQ successor). Key choices:
- Scheme: W4A16, symmetric INT4 weights, group_size=128 targeting Linear layers.
- Ignored:
lm_head
left in higher precision. - Calibration data:
wikitext-2-raw-v1
train[:256], shuffled, preprocessed totext
. - Calibration setup:
num_calibration_samples=128
,max_seq_length=256
. - Orchestration: uses
oneshot()
to stream layers—no manual device map / offloading; relies on llm‑compressor’s memory management. - Export: saved with
save_compressed=True
to include compressed‑tensors runtime metadata for vLLM. - Runtime dtype: activations served in BF16/FP16 (A16) at inference.
The INT8‑W8A16 branch follows the same structure, trading slightly higher memory for extra stability on some workloads.
Quickstart — vLLM (compressed‑tensors)
Install vLLM (recent version recommended):
pip install vllm
Serve (adjust to your hardware):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Fallen-Command-A-111B-v1_Compressed-Tensors --quantization compressed-tensors --tensor-parallel-size 8 --max-model-len 256000 --gpu-memory-utilization 0.70 --dtype bfloat16
Query via Chat Completions:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheHouseOfTheDude/Fallen-Command-A-111B-v1_Compressed-Tensors",
"messages": [
{"role":"system","content":"You are Command-A (finetuned), helpful, precise, and safe."},
{"role":"user","content":"Outline a retrieval pipeline for multilingual legal documents."}
],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.95
}'
Note:
compressed‑tensors
is a vLLM runtime format. Loading this artifact directly in vanilla 🤗 Transformers is not supported; use vLLM for inference. For Transformers, use a different export (e.g., GPTQ/AWQ compatible) or full‑precision weights.
Prompting / Chat Template
This package follows the parent finetune’s chat conventions. If a chat_template.jinja
is present in the branch, apply_chat_template
will use it automatically.
Lineage
- Base model: CohereLabs/c4ai-command-a-03-2025
- Finetuned parent: TheDrummer/Fallen-Command-A-111B-v1
- This repo: Quantized child of the finetune (compressed‑tensors for vLLM)
Hardware & Tips (rule‑of‑thumb)
- 111B dense models typically require multi‑GPU deployments for best throughput.
- Long contexts are KV‑cache heavy—tune
--max-model-len
and batch size. - Prefer BF16 on GPUs with native support; otherwise FP16.
- Consider CUDA Graphs if stable in your stack.
License & Usage
This distribution inherits the licenses/policies of the finetuned parent and its base model.
Use of the model constitutes acceptance of the upstream terms.
Changelog
- v1 (current) — Quantized compressed‑tensors exports for Fallen‑Command‑A‑111B‑v1; added W4A16 and INT8‑W8A16 branches; model card set for Quantized classification.
Model tree for TheHouseOfTheDude/Fallen-Command-A-111B-v1_Compresses-Tensors
Base model
TheDrummer/Fallen-Command-A-111B-v1