--- language: [en, zh] license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3-1.7B tags: [quantization, gptq, int4, 4bit] pipeline_tag: text-generation quantization_config: bits: 4 group_size: 16 damp_percent: 0.1 desc_act: false static_groups: false true_sequential: true model_name_or_path: null model_file_base_name: model --- # Qwen3 1.7B GPTQ INT4 GPTQ 4-bit quantized version of Qwen/Qwen3-1.7B with group size 16. ## Model Details - **Quantization**: GPTQ INT4 with group size 16 - **Size**: ~1GB (4x compression from original) - **Format**: W4A16 (4-bit weights, 16-bit activations) - **Compatible**: Native transformers library support ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "2imi9/qwen3-1.7b-gptq-int4", device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("2imi9/qwen3-1.7b-gptq-int4") # Generate text inputs = tokenizer("Hello, how are you?", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Gradio Demo ```python import gradio as gr from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("2imi9/qwen3-1.7b-gptq-int4", device_map="auto") tokenizer = AutoTokenizer.from_pretrained("2imi9/qwen3-1.7b-gptq-int4") def chat(message, history): inputs = tokenizer(message, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) return response gr.ChatInterface(chat).launch() ``` Perfect for Gradio demos due to small size and fast inference.