Qwen3-8B-abliterated-GPTQ-W8A16 / README.md

Create README.md

a1f065c verified 8 days ago

4.87 kB

	---
	base_model:
	- huihui-ai/Qwen3-8B-abliterated
	tags:
	- qwen
	- '3'
	- abliterated
	- gptq
	- int8
	---
	Model Card: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16

	Model Overview

	Model Name: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16
	Base Model: huihui-ai/Qwen3-8B-abliterated
	Description: This is a quantized version of the uncensored huihui-ai/Qwen3-8B-abliterated model, derived from Qwen/Qwen3-8B. The model has been quantized to GPTQ Int8 W8A16 for maximum inference speed on NVIDIA 3090 GPUs. Abliteration was performed using a novel, faster method to remove refusals, making this a proof-of-concept implementation for uncensored language model behavior.

	Important Note: A newer version, huihui-ai/Huihui-Qwen3-8B-abliterated-v2, is available. Consider using the updated version for improved performance.

	Quantization Details





	Quantization Method: GPTQ Int8 W8A16



	Purpose: Optimized for high-speed inference on NVIDIA 3090 GPUs, reducing memory footprint while maintaining performance.



	Impact: Provides faster inference compared to the unquantized model, suitable for resource-constrained environments.



	Model Size: 2.98B parameters



	Tensor Types: I64, I32, F16

	Usage

	Using with vLLM

	The model can be used with vLLM for efficient inference. Below is an example of how to set up and run the model using vLLM in Python:

	from vllm import LLM, SamplingParams

	# Define model ID
	MODEL_ID = "groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16"

	# Initialize the vLLM model
	llm = LLM(
	model=MODEL_ID,
	dtype="bfloat16", # Use bfloat16 for compatibility with GPTQ quantization
	trust_remote_code=True,
	quantization="gptq", # Specify GPTQ quantization
	gpu_memory_utilization=0.9, # Adjust based on your GPU memory
	)

	# Define sampling parameters
	sampling_params = SamplingParams(
	temperature=0.7,
	max_tokens=8192,
	stop=["/exit"], # Custom stop token for interactive loop
	)

	# Interactive chat loop
	system_prompt = "You are a helpful assistant."
	messages = [{"role": "system", "content": system_prompt}]

	while True:
	user_input = input("User: ").strip()
	if user_input.lower() == "/exit":
	print("Exiting chat.")
	break
	if user_input.lower() == "/clear":
	messages = [{"role": "system", "content": system_prompt}]
	print("Chat history cleared. Starting a new conversation.")
	continue
	if not user_input:
	print("Input cannot be empty. Please enter something.")
	continue

	# Append user input to messages
	messages.append({"role": "user", "content": user_input})

	# Format prompt for vLLM
	prompt = "\n".join([f"{msg['role']}: {msg['content']}" for msg in messages])

	# Generate response
	outputs = llm.generate([prompt], sampling_params)
	response = outputs[0].outputs[0].text.strip()

	# Print and append response
	print(f"Assistant: {response}")
	messages.append({"role": "assistant", "content": response})

	Installation Requirements

	To use the model with vLLM, ensure you have vLLM installed:

	pip install vllm

	Notes





	The model is pre-quantized to GPTQ Int8 W8A16, so specify quantization="gptq" when initializing the LLM object.



	Adjust gpu_memory_utilization based on your GPU's memory capacity to avoid out-of-memory errors.



	The max_tokens parameter can be increased for longer responses, but this may impact performance.



	The model is not deployed by any inference provider. For provider support, contact the repository maintainers at Hugging Face.

	Performance

	Pass Rate for Harmful Instructions

	The pass rate measures the proportion of harmful instructions that do not trigger refusals, calculated as (total - triggered_total) / total. The test set is sourced from huihui-ai/harmbench_behaviors, evaluated using TestPassed.py.

	Test Results:





	Model: huihui-ai/Qwen3-8B-abliterated



	Passed Total: 320/320



	Passed Ratio: 1.00 (100.00%)

	Comparison:







	Model



	Passed Total



	Passed Ratio





	Qwen3-8B



	195/320



	60.94%





	Qwen3-8B-abliterated



	320/320



	100.00%

	Note: The test provides a preliminary assessment. For comprehensive results, consider increasing the max_tokens value during evaluation.

	Limitations





	This model is a proof-of-concept with abliteration to remove refusals, which may lead to unpredictable behavior on certain inputs.



	The quantization to GPTQ Int8 W8A16 may introduce minor performance trade-offs compared to the unquantized model, though optimized for speed.



	Users should verify outputs for sensitive applications, as the model is uncensored and may generate harmful or inappropriate content.

	References





	Repository: groxaxo/Qwen3-8B-abliterated-GPTQ-W8A16



	Base Model: Qwen/Qwen3-8B



	Abliteration Method: remove-refusals-with-transformers



	Test Set: huihui-ai/harmbench_behaviors



	Newer Version: huihui-ai/Huihui-Qwen3-8B-abliterated-v2