Fallen-Command-A-111B-v1 — Quantized (compressed-tensors for vLLM)

This repository provides quantized runtime packages of
TheDrummer/Fallen-Command-A-111B-v1, a finetune of
CohereLabs/c4ai-command-a-03-2025 (aka Command A), repackaged for vLLM using the compressed‑tensors format.

TL;DR

  • This repo is quantized with branches W4A16 and W8A16.
  • Load with vLLM using --quantization compressed-tensors.
  • Command A (111B) is a dense, enterprise‑oriented model with 256K context, high throughput, and strong capabilities for tool use, agents, RAG, and multilingual tasks.

Revisions & Branches

The main branch is a landing page (model card + links). All runnable artifacts live under per‑revision branches.

  • main — placeholder / landing page
  • W4A16 — 4‑bit weights / 16‑bit activations builds and runtime assets
  • W8A16 — 8‑bit weights / 16‑bit activations builds

Repository Contents (per revision)

  • Sharded quantized weights in .safetensors with an index (model.safetensors.index.json)
  • config.json including compressed‑tensors metadata (weight_format, quantization, quantization_config)
  • Tokenizer artifacts (tokenizer.json, tokenizer.model, etc.)
  • Optional: chat_template.jinja (inherits the parent finetune’s chat format)

Exact files can differ by branch; see the Files and versions tab for each revision.


About Command A (how it differs from Qwen/Qwen3 and others)

  • Dense 111B (not MoE): All parameters are active at inference; optimized for throughput and enterprise reliability.
  • 256K context: supports very long conversations and documents.
  • Enterprise agentic focus: excels at tool use, RAG, agents, and multilingual tasks.
  • Efficiency: designed for high tokens/sec and practical deployment footprints compared to similarly strong models.

See the Command A resources for details (technical report, model card, and product docs).


Quantization recipe & implementation notes (from the attached script)

The W4A16 builds in this repo were produced with a modern AWQ recipe via llm‑compressor (AutoAWQ successor). Key choices:

  • Scheme: W4A16, symmetric INT4 weights, group_size=128 targeting Linear layers.
  • Ignored: lm_head left in higher precision.
  • Calibration data: wikitext-2-raw-v1 train[:256], shuffled, preprocessed to text.
  • Calibration setup: num_calibration_samples=128, max_seq_length=256.
  • Orchestration: uses oneshot() to stream layers—no manual device map / offloading; relies on llm‑compressor’s memory management.
  • Export: saved with save_compressed=True to include compressed‑tensors runtime metadata for vLLM.
  • Runtime dtype: activations served in BF16/FP16 (A16) at inference.

The INT8‑W8A16 branch follows the same structure, trading slightly higher memory for extra stability on some workloads.


Quickstart — vLLM (compressed‑tensors)

Install vLLM (recent version recommended):

pip install vllm

Serve (adjust to your hardware):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Fallen-Command-A-111B-v1_Compressed-Tensors   --quantization compressed-tensors   --tensor-parallel-size 8   --max-model-len 256000   --gpu-memory-utilization 0.70   --dtype bfloat16

Query via Chat Completions:

curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "TheHouseOfTheDude/Fallen-Command-A-111B-v1_Compressed-Tensors",
    "messages": [
      {"role":"system","content":"You are Command-A (finetuned), helpful, precise, and safe."},
      {"role":"user","content":"Outline a retrieval pipeline for multilingual legal documents."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Note: compressed‑tensors is a vLLM runtime format. Loading this artifact directly in vanilla 🤗 Transformers is not supported; use vLLM for inference. For Transformers, use a different export (e.g., GPTQ/AWQ compatible) or full‑precision weights.


Prompting / Chat Template

This package follows the parent finetune’s chat conventions. If a chat_template.jinja is present in the branch, apply_chat_template will use it automatically.


Lineage


Hardware & Tips (rule‑of‑thumb)

  • 111B dense models typically require multi‑GPU deployments for best throughput.
  • Long contexts are KV‑cache heavy—tune --max-model-len and batch size.
  • Prefer BF16 on GPUs with native support; otherwise FP16.
  • Consider CUDA Graphs if stable in your stack.

License & Usage

This distribution inherits the licenses/policies of the finetuned parent and its base model.
Use of the model constitutes acceptance of the upstream terms.


Changelog

  • v1 (current) — Quantized compressed‑tensors exports for Fallen‑Command‑A‑111B‑v1; added W4A16 and INT8‑W8A16 branches; model card set for Quantized classification.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TheHouseOfTheDude/Fallen-Command-A-111B-v1_Compresses-Tensors

Quantized
(4)
this model