|
--- |
|
base_model: |
|
- huihui-ai/Qwen3-8B-abliterated |
|
tags: |
|
- qwen |
|
- '3' |
|
- abliterated |
|
- gptq |
|
- int8 |
|
--- |
|
Model Card: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16 |
|
|
|
Model Overview |
|
|
|
Model Name: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16 |
|
Base Model: huihui-ai/Qwen3-8B-abliterated |
|
Description: This is a quantized version of the uncensored huihui-ai/Qwen3-8B-abliterated model, derived from Qwen/Qwen3-8B. The model has been quantized to GPTQ Int8 W8A16 for maximum inference speed on NVIDIA 3090 GPUs. Abliteration was performed using a novel, faster method to remove refusals, making this a proof-of-concept implementation for uncensored language model behavior. |
|
|
|
Important Note: A newer version, huihui-ai/Huihui-Qwen3-8B-abliterated-v2, is available. Consider using the updated version for improved performance. |
|
|
|
Quantization Details |
|
|
|
|
|
|
|
|
|
|
|
Quantization Method: GPTQ Int8 W8A16 |
|
|
|
|
|
|
|
Purpose: Optimized for high-speed inference on NVIDIA 3090 GPUs, reducing memory footprint while maintaining performance. |
|
|
|
|
|
|
|
Impact: Provides faster inference compared to the unquantized model, suitable for resource-constrained environments. |
|
|
|
|
|
|
|
Model Size: 2.98B parameters |
|
|
|
|
|
|
|
Tensor Types: I64, I32, F16 |
|
|
|
Usage |
|
|
|
Using with vLLM |
|
|
|
The model can be used with vLLM for efficient inference. Below is an example of how to set up and run the model using vLLM in Python: |
|
|
|
from vllm import LLM, SamplingParams |
|
|
|
# Define model ID |
|
MODEL_ID = "groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16" |
|
|
|
# Initialize the vLLM model |
|
llm = LLM( |
|
model=MODEL_ID, |
|
dtype="bfloat16", # Use bfloat16 for compatibility with GPTQ quantization |
|
trust_remote_code=True, |
|
quantization="gptq", # Specify GPTQ quantization |
|
gpu_memory_utilization=0.9, # Adjust based on your GPU memory |
|
) |
|
|
|
# Define sampling parameters |
|
sampling_params = SamplingParams( |
|
temperature=0.7, |
|
max_tokens=8192, |
|
stop=["/exit"], # Custom stop token for interactive loop |
|
) |
|
|
|
# Interactive chat loop |
|
system_prompt = "You are a helpful assistant." |
|
messages = [{"role": "system", "content": system_prompt}] |
|
|
|
while True: |
|
user_input = input("User: ").strip() |
|
if user_input.lower() == "/exit": |
|
print("Exiting chat.") |
|
break |
|
if user_input.lower() == "/clear": |
|
messages = [{"role": "system", "content": system_prompt}] |
|
print("Chat history cleared. Starting a new conversation.") |
|
continue |
|
if not user_input: |
|
print("Input cannot be empty. Please enter something.") |
|
continue |
|
|
|
# Append user input to messages |
|
messages.append({"role": "user", "content": user_input}) |
|
|
|
# Format prompt for vLLM |
|
prompt = "\n".join([f"{msg['role']}: {msg['content']}" for msg in messages]) |
|
|
|
# Generate response |
|
outputs = llm.generate([prompt], sampling_params) |
|
response = outputs[0].outputs[0].text.strip() |
|
|
|
# Print and append response |
|
print(f"Assistant: {response}") |
|
messages.append({"role": "assistant", "content": response}) |
|
|
|
Installation Requirements |
|
|
|
To use the model with vLLM, ensure you have vLLM installed: |
|
|
|
pip install vllm |
|
|
|
Notes |
|
|
|
|
|
|
|
|
|
|
|
The model is pre-quantized to GPTQ Int8 W8A16, so specify quantization="gptq" when initializing the LLM object. |
|
|
|
|
|
|
|
Adjust gpu_memory_utilization based on your GPU's memory capacity to avoid out-of-memory errors. |
|
|
|
|
|
|
|
The max_tokens parameter can be increased for longer responses, but this may impact performance. |
|
|
|
|
|
|
|
The model is not deployed by any inference provider. For provider support, contact the repository maintainers at Hugging Face. |
|
|
|
Performance |
|
|
|
Pass Rate for Harmful Instructions |
|
|
|
The pass rate measures the proportion of harmful instructions that do not trigger refusals, calculated as (total - triggered_total) / total. The test set is sourced from huihui-ai/harmbench_behaviors, evaluated using TestPassed.py. |
|
|
|
Test Results: |
|
|
|
|
|
|
|
|
|
|
|
Model: huihui-ai/Qwen3-8B-abliterated |
|
|
|
|
|
|
|
Passed Total: 320/320 |
|
|
|
|
|
|
|
Passed Ratio: 1.00 (100.00%) |
|
|
|
Comparison: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Model |
|
|
|
|
|
|
|
Passed Total |
|
|
|
|
|
|
|
Passed Ratio |
|
|
|
|
|
|
|
|
|
|
|
Qwen3-8B |
|
|
|
|
|
|
|
195/320 |
|
|
|
|
|
|
|
60.94% |
|
|
|
|
|
|
|
|
|
|
|
Qwen3-8B-abliterated |
|
|
|
|
|
|
|
320/320 |
|
|
|
|
|
|
|
100.00% |
|
|
|
Note: The test provides a preliminary assessment. For comprehensive results, consider increasing the max_tokens value during evaluation. |
|
|
|
Limitations |
|
|
|
|
|
|
|
|
|
|
|
This model is a proof-of-concept with abliteration to remove refusals, which may lead to unpredictable behavior on certain inputs. |
|
|
|
|
|
|
|
The quantization to GPTQ Int8 W8A16 may introduce minor performance trade-offs compared to the unquantized model, though optimized for speed. |
|
|
|
|
|
|
|
Users should verify outputs for sensitive applications, as the model is uncensored and may generate harmful or inappropriate content. |
|
|
|
References |
|
|
|
|
|
|
|
|
|
|
|
Repository: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16 |
|
|
|
|
|
|
|
Base Model: Qwen/Qwen3-8B |
|
|
|
|
|
|
|
Abliteration Method: remove-refusals-with-transformers |
|
|
|
|
|
|
|
Test Set: huihui-ai/harmbench_behaviors |
|
|
|
|
|
|
|
Newer Version: huihui-ai/Huihui-Qwen3-8B-abliterated-v2 |