File size: 4,872 Bytes
a1f065c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
base_model:
- huihui-ai/Qwen3-8B-abliterated
tags:
- qwen
- '3'
- abliterated
- gptq
- int8
---
Model Card: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16

Model Overview

Model Name: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
Base Model: huihui-ai/Qwen3-8B-abliterated
Description: This is a quantized version of the uncensored huihui-ai/Qwen3-8B-abliterated model, derived from Qwen/Qwen3-8B. The model has been quantized to GPTQ Int8 W8A16 for maximum inference speed on NVIDIA 3090 GPUs. Abliteration was performed using a novel, faster method to remove refusals, making this a proof-of-concept implementation for uncensored language model behavior.

Important Note: A newer version, huihui-ai/Huihui-Qwen3-8B-abliterated-v2, is available. Consider using the updated version for improved performance.

Quantization Details





Quantization Method: GPTQ Int8 W8A16



Purpose: Optimized for high-speed inference on NVIDIA 3090 GPUs, reducing memory footprint while maintaining performance.



Impact: Provides faster inference compared to the unquantized model, suitable for resource-constrained environments.



Model Size: 2.98B parameters



Tensor Types: I64, I32, F16

Usage

Using with vLLM

The model can be used with vLLM for efficient inference. Below is an example of how to set up and run the model using vLLM in Python:

from vllm import LLM, SamplingParams

# Define model ID
MODEL_ID = "groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16"

# Initialize the vLLM model
llm = LLM(
    model=MODEL_ID,
    dtype="bfloat16",  # Use bfloat16 for compatibility with GPTQ quantization
    trust_remote_code=True,
    quantization="gptq",  # Specify GPTQ quantization
    gpu_memory_utilization=0.9,  # Adjust based on your GPU memory
)

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=8192,
    stop=["/exit"],  # Custom stop token for interactive loop
)

# Interactive chat loop
system_prompt = "You are a helpful assistant."
messages = [{"role": "system", "content": system_prompt}]

while True:
    user_input = input("User: ").strip()
    if user_input.lower() == "/exit":
        print("Exiting chat.")
        break
    if user_input.lower() == "/clear":
        messages = [{"role": "system", "content": system_prompt}]
        print("Chat history cleared. Starting a new conversation.")
        continue
    if not user_input:
        print("Input cannot be empty. Please enter something.")
        continue
    
    # Append user input to messages
    messages.append({"role": "user", "content": user_input})
    
    # Format prompt for vLLM
    prompt = "\n".join([f"{msg['role']}: {msg['content']}" for msg in messages])
    
    # Generate response
    outputs = llm.generate([prompt], sampling_params)
    response = outputs[0].outputs[0].text.strip()
    
    # Print and append response
    print(f"Assistant: {response}")
    messages.append({"role": "assistant", "content": response})

Installation Requirements

To use the model with vLLM, ensure you have vLLM installed:

pip install vllm

Notes





The model is pre-quantized to GPTQ Int8 W8A16, so specify quantization="gptq" when initializing the LLM object.



Adjust gpu_memory_utilization based on your GPU's memory capacity to avoid out-of-memory errors.



The max_tokens parameter can be increased for longer responses, but this may impact performance.



The model is not deployed by any inference provider. For provider support, contact the repository maintainers at Hugging Face.

Performance

Pass Rate for Harmful Instructions

The pass rate measures the proportion of harmful instructions that do not trigger refusals, calculated as (total - triggered_total) / total. The test set is sourced from huihui-ai/harmbench_behaviors, evaluated using TestPassed.py.

Test Results:





Model: huihui-ai/Qwen3-8B-abliterated



Passed Total: 320/320



Passed Ratio: 1.00 (100.00%)

Comparison:







Model



Passed Total



Passed Ratio





Qwen3-8B



195/320



60.94%





Qwen3-8B-abliterated



320/320



100.00%

Note: The test provides a preliminary assessment. For comprehensive results, consider increasing the max_tokens value during evaluation.

Limitations





This model is a proof-of-concept with abliteration to remove refusals, which may lead to unpredictable behavior on certain inputs.



The quantization to GPTQ Int8 W8A16 may introduce minor performance trade-offs compared to the unquantized model, though optimized for speed.



Users should verify outputs for sensitive applications, as the model is uncensored and may generate harmful or inappropriate content.

References





Repository: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16



Base Model: Qwen/Qwen3-8B



Abliteration Method: remove-refusals-with-transformers



Test Set: huihui-ai/harmbench_behaviors



Newer Version: huihui-ai/Huihui-Qwen3-8B-abliterated-v2