groxaxo
/

Qwen3-8B-abliterated-GPTQ-W8A16

+---
+base_model:
+- huihui-ai/Qwen3-8B-abliterated
+tags:
+- qwen
+- '3'
+- abliterated
+- gptq
+- int8
+---
+Model Card: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
+Model Overview
+Model Name: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
+Base Model: huihui-ai/Qwen3-8B-abliterated
+Description: This is a quantized version of the uncensored huihui-ai/Qwen3-8B-abliterated model, derived from Qwen/Qwen3-8B. The model has been quantized to GPTQ Int8 W8A16 for maximum inference speed on NVIDIA 3090 GPUs. Abliteration was performed using a novel, faster method to remove refusals, making this a proof-of-concept implementation for uncensored language model behavior.
+Important Note: A newer version, huihui-ai/Huihui-Qwen3-8B-abliterated-v2, is available. Consider using the updated version for improved performance.
+Quantization Details
+Quantization Method: GPTQ Int8 W8A16
+Purpose: Optimized for high-speed inference on NVIDIA 3090 GPUs, reducing memory footprint while maintaining performance.
+Impact: Provides faster inference compared to the unquantized model, suitable for resource-constrained environments.
+Model Size: 2.98B parameters
+Tensor Types: I64, I32, F16
+Usage
+Using with vLLM
+The model can be used with vLLM for efficient inference. Below is an example of how to set up and run the model using vLLM in Python:
+from vllm import LLM, SamplingParams
+# Define model ID
+MODEL_ID = "groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16"
+# Initialize the vLLM model
+llm = LLM(
+    model=MODEL_ID,
+    dtype="bfloat16",  # Use bfloat16 for compatibility with GPTQ quantization
+    trust_remote_code=True,
+    quantization="gptq",  # Specify GPTQ quantization
+    gpu_memory_utilization=0.9,  # Adjust based on your GPU memory
+)
+# Define sampling parameters
+sampling_params = SamplingParams(
+    temperature=0.7,
+    max_tokens=8192,
+    stop=["/exit"],  # Custom stop token for interactive loop
+)
+# Interactive chat loop
+system_prompt = "You are a helpful assistant."
+messages = [{"role": "system", "content": system_prompt}]
+while True:
+    user_input = input("User: ").strip()
+    if user_input.lower() == "/exit":
+        print("Exiting chat.")
+        break
+    if user_input.lower() == "/clear":
+        messages = [{"role": "system", "content": system_prompt}]
+        print("Chat history cleared. Starting a new conversation.")
+        continue
+    if not user_input:
+        print("Input cannot be empty. Please enter something.")
+        continue
+    # Append user input to messages
+    messages.append({"role": "user", "content": user_input})
+    # Format prompt for vLLM
+    prompt = "\n".join([f"{msg['role']}: {msg['content']}" for msg in messages])
+    # Generate response
+    outputs = llm.generate([prompt], sampling_params)
+    response = outputs[0].outputs[0].text.strip()
+    # Print and append response
+    print(f"Assistant: {response}")
+    messages.append({"role": "assistant", "content": response})
+Installation Requirements
+To use the model with vLLM, ensure you have vLLM installed:
+pip install vllm
+Notes
+The model is pre-quantized to GPTQ Int8 W8A16, so specify quantization="gptq" when initializing the LLM object.
+Adjust gpu_memory_utilization based on your GPU's memory capacity to avoid out-of-memory errors.
+The max_tokens parameter can be increased for longer responses, but this may impact performance.
+The model is not deployed by any inference provider. For provider support, contact the repository maintainers at Hugging Face.
+Performance
+Pass Rate for Harmful Instructions
+The pass rate measures the proportion of harmful instructions that do not trigger refusals, calculated as (total - triggered_total) / total. The test set is sourced from huihui-ai/harmbench_behaviors, evaluated using TestPassed.py.
+Test Results:
+Model: huihui-ai/Qwen3-8B-abliterated
+Passed Total: 320/320
+Passed Ratio: 1.00 (100.00%)
+Comparison:
+Model
+Passed Total
+Passed Ratio
+Qwen3-8B
+195/320
+60.94%
+Qwen3-8B-abliterated
+320/320
+100.00%
+Note: The test provides a preliminary assessment. For comprehensive results, consider increasing the max_tokens value during evaluation.
+Limitations
+This model is a proof-of-concept with abliteration to remove refusals, which may lead to unpredictable behavior on certain inputs.
+The quantization to GPTQ Int8 W8A16 may introduce minor performance trade-offs compared to the unquantized model, though optimized for speed.
+Users should verify outputs for sensitive applications, as the model is uncensored and may generate harmful or inappropriate content.
+References
+Repository: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
+Base Model: Qwen/Qwen3-8B
+Abliteration Method: remove-refusals-with-transformers
+Test Set: huihui-ai/harmbench_behaviors
+Newer Version: huihui-ai/Huihui-Qwen3-8B-abliterated-v2