The model will still refuse to answer sensitive questions

by xldistance - opened 10 days ago

Discussion

xldistance

10 days ago

The model did not cancel the restriction successfully

roslein

Owner 10 days ago

The model can answer with profanity ect. but it cannot produce content which would really bypass all constrictions. The problem is that there is no single layer which would be refusal and alliteration of the entire model including first layer would cause to produce rubbish. This sucks since the model needs to be retrained on several epochs entirely with harmful answers.

xldistance

10 days ago

•

edited 10 days ago

https://huggingface.co/huihui-ai/Qwen3-14B-abliterated, Canyou train the qwen3-32b-abliterated model using this model's training method?This model can remove restrictions

roslein

Owner 8 days ago

Yes I have used this code from huihui - https://github.com/Sumandora/remove-refusals-with-transformers.git however, it is important how to figure the balance, when applying harsher setting the model is severely damaged. This model is tricky, I hope mlabone or huihui does figure it out without significant damage to the model performance. Please contact them :-)

DavidAU

3 days ago

@roslein

Just a heads up, issue may be with # experts activated. Changing this was critical to imatrix'ing this model.
Ah... up... WAY up... 24/32/64 experts activated.
Hope this helps ;

roslein

Owner 3 days ago

Well, I am bit slow ... the MoE achitecture of Qwen3 is Qwen3-30B-A3B and the 235B and indeed these models require different approach. In this case user mlabone activated almost all experts as you suggest 64 ect. and ablated multiple layers with sucess, but I don't undestand how it could help in this case since this is not MoE.

DavidAU

2 days ago

My bad ; commented on the wrong model ; !

maldv

2 days ago

I had a realization a while back. Abliteration might be failing in models if the model can throw off a rejection later in a turn (or deeper in a thinking turn). You might want to capture a few on-policy examples to see if you can catch the misses.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment