Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model:
|
3 |
+
- huihui-ai/Qwen3-8B-abliterated
|
4 |
+
tags:
|
5 |
+
- qwen
|
6 |
+
- '3'
|
7 |
+
- abliterated
|
8 |
+
- gptq
|
9 |
+
- int8
|
10 |
+
---
|
11 |
+
Model Card: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
|
12 |
+
|
13 |
+
Model Overview
|
14 |
+
|
15 |
+
Model Name: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
|
16 |
+
Base Model: huihui-ai/Qwen3-8B-abliterated
|
17 |
+
Description: This is a quantized version of the uncensored huihui-ai/Qwen3-8B-abliterated model, derived from Qwen/Qwen3-8B. The model has been quantized to GPTQ Int8 W8A16 for maximum inference speed on NVIDIA 3090 GPUs. Abliteration was performed using a novel, faster method to remove refusals, making this a proof-of-concept implementation for uncensored language model behavior.
|
18 |
+
|
19 |
+
Important Note: A newer version, huihui-ai/Huihui-Qwen3-8B-abliterated-v2, is available. Consider using the updated version for improved performance.
|
20 |
+
|
21 |
+
Quantization Details
|
22 |
+
|
23 |
+
|
24 |
+
|
25 |
+
|
26 |
+
|
27 |
+
Quantization Method: GPTQ Int8 W8A16
|
28 |
+
|
29 |
+
|
30 |
+
|
31 |
+
Purpose: Optimized for high-speed inference on NVIDIA 3090 GPUs, reducing memory footprint while maintaining performance.
|
32 |
+
|
33 |
+
|
34 |
+
|
35 |
+
Impact: Provides faster inference compared to the unquantized model, suitable for resource-constrained environments.
|
36 |
+
|
37 |
+
|
38 |
+
|
39 |
+
Model Size: 2.98B parameters
|
40 |
+
|
41 |
+
|
42 |
+
|
43 |
+
Tensor Types: I64, I32, F16
|
44 |
+
|
45 |
+
Usage
|
46 |
+
|
47 |
+
Using with vLLM
|
48 |
+
|
49 |
+
The model can be used with vLLM for efficient inference. Below is an example of how to set up and run the model using vLLM in Python:
|
50 |
+
|
51 |
+
from vllm import LLM, SamplingParams
|
52 |
+
|
53 |
+
# Define model ID
|
54 |
+
MODEL_ID = "groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16"
|
55 |
+
|
56 |
+
# Initialize the vLLM model
|
57 |
+
llm = LLM(
|
58 |
+
model=MODEL_ID,
|
59 |
+
dtype="bfloat16", # Use bfloat16 for compatibility with GPTQ quantization
|
60 |
+
trust_remote_code=True,
|
61 |
+
quantization="gptq", # Specify GPTQ quantization
|
62 |
+
gpu_memory_utilization=0.9, # Adjust based on your GPU memory
|
63 |
+
)
|
64 |
+
|
65 |
+
# Define sampling parameters
|
66 |
+
sampling_params = SamplingParams(
|
67 |
+
temperature=0.7,
|
68 |
+
max_tokens=8192,
|
69 |
+
stop=["/exit"], # Custom stop token for interactive loop
|
70 |
+
)
|
71 |
+
|
72 |
+
# Interactive chat loop
|
73 |
+
system_prompt = "You are a helpful assistant."
|
74 |
+
messages = [{"role": "system", "content": system_prompt}]
|
75 |
+
|
76 |
+
while True:
|
77 |
+
user_input = input("User: ").strip()
|
78 |
+
if user_input.lower() == "/exit":
|
79 |
+
print("Exiting chat.")
|
80 |
+
break
|
81 |
+
if user_input.lower() == "/clear":
|
82 |
+
messages = [{"role": "system", "content": system_prompt}]
|
83 |
+
print("Chat history cleared. Starting a new conversation.")
|
84 |
+
continue
|
85 |
+
if not user_input:
|
86 |
+
print("Input cannot be empty. Please enter something.")
|
87 |
+
continue
|
88 |
+
|
89 |
+
# Append user input to messages
|
90 |
+
messages.append({"role": "user", "content": user_input})
|
91 |
+
|
92 |
+
# Format prompt for vLLM
|
93 |
+
prompt = "\n".join([f"{msg['role']}: {msg['content']}" for msg in messages])
|
94 |
+
|
95 |
+
# Generate response
|
96 |
+
outputs = llm.generate([prompt], sampling_params)
|
97 |
+
response = outputs[0].outputs[0].text.strip()
|
98 |
+
|
99 |
+
# Print and append response
|
100 |
+
print(f"Assistant: {response}")
|
101 |
+
messages.append({"role": "assistant", "content": response})
|
102 |
+
|
103 |
+
Installation Requirements
|
104 |
+
|
105 |
+
To use the model with vLLM, ensure you have vLLM installed:
|
106 |
+
|
107 |
+
pip install vllm
|
108 |
+
|
109 |
+
Notes
|
110 |
+
|
111 |
+
|
112 |
+
|
113 |
+
|
114 |
+
|
115 |
+
The model is pre-quantized to GPTQ Int8 W8A16, so specify quantization="gptq" when initializing the LLM object.
|
116 |
+
|
117 |
+
|
118 |
+
|
119 |
+
Adjust gpu_memory_utilization based on your GPU's memory capacity to avoid out-of-memory errors.
|
120 |
+
|
121 |
+
|
122 |
+
|
123 |
+
The max_tokens parameter can be increased for longer responses, but this may impact performance.
|
124 |
+
|
125 |
+
|
126 |
+
|
127 |
+
The model is not deployed by any inference provider. For provider support, contact the repository maintainers at Hugging Face.
|
128 |
+
|
129 |
+
Performance
|
130 |
+
|
131 |
+
Pass Rate for Harmful Instructions
|
132 |
+
|
133 |
+
The pass rate measures the proportion of harmful instructions that do not trigger refusals, calculated as (total - triggered_total) / total. The test set is sourced from huihui-ai/harmbench_behaviors, evaluated using TestPassed.py.
|
134 |
+
|
135 |
+
Test Results:
|
136 |
+
|
137 |
+
|
138 |
+
|
139 |
+
|
140 |
+
|
141 |
+
Model: huihui-ai/Qwen3-8B-abliterated
|
142 |
+
|
143 |
+
|
144 |
+
|
145 |
+
Passed Total: 320/320
|
146 |
+
|
147 |
+
|
148 |
+
|
149 |
+
Passed Ratio: 1.00 (100.00%)
|
150 |
+
|
151 |
+
Comparison:
|
152 |
+
|
153 |
+
|
154 |
+
|
155 |
+
|
156 |
+
|
157 |
+
|
158 |
+
|
159 |
+
Model
|
160 |
+
|
161 |
+
|
162 |
+
|
163 |
+
Passed Total
|
164 |
+
|
165 |
+
|
166 |
+
|
167 |
+
Passed Ratio
|
168 |
+
|
169 |
+
|
170 |
+
|
171 |
+
|
172 |
+
|
173 |
+
Qwen3-8B
|
174 |
+
|
175 |
+
|
176 |
+
|
177 |
+
195/320
|
178 |
+
|
179 |
+
|
180 |
+
|
181 |
+
60.94%
|
182 |
+
|
183 |
+
|
184 |
+
|
185 |
+
|
186 |
+
|
187 |
+
Qwen3-8B-abliterated
|
188 |
+
|
189 |
+
|
190 |
+
|
191 |
+
320/320
|
192 |
+
|
193 |
+
|
194 |
+
|
195 |
+
100.00%
|
196 |
+
|
197 |
+
Note: The test provides a preliminary assessment. For comprehensive results, consider increasing the max_tokens value during evaluation.
|
198 |
+
|
199 |
+
Limitations
|
200 |
+
|
201 |
+
|
202 |
+
|
203 |
+
|
204 |
+
|
205 |
+
This model is a proof-of-concept with abliteration to remove refusals, which may lead to unpredictable behavior on certain inputs.
|
206 |
+
|
207 |
+
|
208 |
+
|
209 |
+
The quantization to GPTQ Int8 W8A16 may introduce minor performance trade-offs compared to the unquantized model, though optimized for speed.
|
210 |
+
|
211 |
+
|
212 |
+
|
213 |
+
Users should verify outputs for sensitive applications, as the model is uncensored and may generate harmful or inappropriate content.
|
214 |
+
|
215 |
+
References
|
216 |
+
|
217 |
+
|
218 |
+
|
219 |
+
|
220 |
+
|
221 |
+
Repository: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
|
222 |
+
|
223 |
+
|
224 |
+
|
225 |
+
Base Model: Qwen/Qwen3-8B
|
226 |
+
|
227 |
+
|
228 |
+
|
229 |
+
Abliteration Method: remove-refusals-with-transformers
|
230 |
+
|
231 |
+
|
232 |
+
|
233 |
+
Test Set: huihui-ai/harmbench_behaviors
|
234 |
+
|
235 |
+
|
236 |
+
|
237 |
+
Newer Version: huihui-ai/Huihui-Qwen3-8B-abliterated-v2
|