--- license: apache-2.0 tags: - qwen - qwen3-4b - thinking - quantized - w8a8 - llm-compressor base_model: - Qwen/Qwen3-4B-Thinking-2507 base_model_relation: quantized pipeline_tag: text-generation --- # Qwen3-4B-Thinking-2507 - W8A8 Quantized This is a W8A8 (8-bit weights and 8-bit activations) quantized version of **[Qwen/Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)**, created using [LLM-Compressor](https://github.com/neuralmagic/llm-compressor). This model was quantized by [itroot](https://huggingface.co/itroot). ## Quantization Recipe ```python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.transformers import oneshot model_id = "Qwen/Qwen3-4B-Thinking-2507" model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True, offload_folder="./offload_tmp", # for 2x 3090s. max_memory={0: "22GB", 1: "22GB", "cpu": "64GB"}, ) tokenizer = AutoTokenizer.from_pretrained(model_id) DATASET_ID = "HuggingFaceH4/ultrachat_200k" DATASET_SPLIT = "train_sft" NUM_CALIBRATION_SAMPLES = 512 MAX_SEQUENCE_LENGTH = 2048 print("Loading and preprocessing calibration dataset...") ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]") ds = ds.shuffle(seed=42) def preprocess(example): return { "text": tokenizer.apply_chat_template( example["messages"], tokenize=False, ) } ds = ds.map(preprocess) def tokenize(sample): return tokenizer( sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False, ) ds = ds.map(tokenize, remove_columns=ds.column_names) print("Dataset ready.") recipe = [ SmoothQuantModifier(smoothing_strength=0.8), GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]), ] output_dir = "./Qwen3-4B-Thinking-2507-W8A8" print(f"Starting one-shot quantization. Output will be in '{output_dir}'") oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, output_dir=output_dir, ) print("Quantization complete.") SAVE_DIR = "Qwen3-4B-Thinking-2507-W8A8" print(f"Saving compressed model and tokenizer to '{SAVE_DIR}'...") model.save_pretrained(SAVE_DIR, save_compressed=True) tokenizer.save_pretrained(SAVE_DIR) ```