File size: 4,872 Bytes
a1f065c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 |
---
base_model:
- huihui-ai/Qwen3-8B-abliterated
tags:
- qwen
- '3'
- abliterated
- gptq
- int8
---
Model Card: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
Model Overview
Model Name: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
Base Model: huihui-ai/Qwen3-8B-abliterated
Description: This is a quantized version of the uncensored huihui-ai/Qwen3-8B-abliterated model, derived from Qwen/Qwen3-8B. The model has been quantized to GPTQ Int8 W8A16 for maximum inference speed on NVIDIA 3090 GPUs. Abliteration was performed using a novel, faster method to remove refusals, making this a proof-of-concept implementation for uncensored language model behavior.
Important Note: A newer version, huihui-ai/Huihui-Qwen3-8B-abliterated-v2, is available. Consider using the updated version for improved performance.
Quantization Details
Quantization Method: GPTQ Int8 W8A16
Purpose: Optimized for high-speed inference on NVIDIA 3090 GPUs, reducing memory footprint while maintaining performance.
Impact: Provides faster inference compared to the unquantized model, suitable for resource-constrained environments.
Model Size: 2.98B parameters
Tensor Types: I64, I32, F16
Usage
Using with vLLM
The model can be used with vLLM for efficient inference. Below is an example of how to set up and run the model using vLLM in Python:
from vllm import LLM, SamplingParams
# Define model ID
MODEL_ID = "groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16"
# Initialize the vLLM model
llm = LLM(
model=MODEL_ID,
dtype="bfloat16", # Use bfloat16 for compatibility with GPTQ quantization
trust_remote_code=True,
quantization="gptq", # Specify GPTQ quantization
gpu_memory_utilization=0.9, # Adjust based on your GPU memory
)
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=8192,
stop=["/exit"], # Custom stop token for interactive loop
)
# Interactive chat loop
system_prompt = "You are a helpful assistant."
messages = [{"role": "system", "content": system_prompt}]
while True:
user_input = input("User: ").strip()
if user_input.lower() == "/exit":
print("Exiting chat.")
break
if user_input.lower() == "/clear":
messages = [{"role": "system", "content": system_prompt}]
print("Chat history cleared. Starting a new conversation.")
continue
if not user_input:
print("Input cannot be empty. Please enter something.")
continue
# Append user input to messages
messages.append({"role": "user", "content": user_input})
# Format prompt for vLLM
prompt = "\n".join([f"{msg['role']}: {msg['content']}" for msg in messages])
# Generate response
outputs = llm.generate([prompt], sampling_params)
response = outputs[0].outputs[0].text.strip()
# Print and append response
print(f"Assistant: {response}")
messages.append({"role": "assistant", "content": response})
Installation Requirements
To use the model with vLLM, ensure you have vLLM installed:
pip install vllm
Notes
The model is pre-quantized to GPTQ Int8 W8A16, so specify quantization="gptq" when initializing the LLM object.
Adjust gpu_memory_utilization based on your GPU's memory capacity to avoid out-of-memory errors.
The max_tokens parameter can be increased for longer responses, but this may impact performance.
The model is not deployed by any inference provider. For provider support, contact the repository maintainers at Hugging Face.
Performance
Pass Rate for Harmful Instructions
The pass rate measures the proportion of harmful instructions that do not trigger refusals, calculated as (total - triggered_total) / total. The test set is sourced from huihui-ai/harmbench_behaviors, evaluated using TestPassed.py.
Test Results:
Model: huihui-ai/Qwen3-8B-abliterated
Passed Total: 320/320
Passed Ratio: 1.00 (100.00%)
Comparison:
Model
Passed Total
Passed Ratio
Qwen3-8B
195/320
60.94%
Qwen3-8B-abliterated
320/320
100.00%
Note: The test provides a preliminary assessment. For comprehensive results, consider increasing the max_tokens value during evaluation.
Limitations
This model is a proof-of-concept with abliteration to remove refusals, which may lead to unpredictable behavior on certain inputs.
The quantization to GPTQ Int8 W8A16 may introduce minor performance trade-offs compared to the unquantized model, though optimized for speed.
Users should verify outputs for sensitive applications, as the model is uncensored and may generate harmful or inappropriate content.
References
Repository: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
Base Model: Qwen/Qwen3-8B
Abliteration Method: remove-refusals-with-transformers
Test Set: huihui-ai/harmbench_behaviors
Newer Version: huihui-ai/Huihui-Qwen3-8B-abliterated-v2 |