The model will still refuse to answer sensitive questions

#1
by xldistance - opened

The model did not cancel the restriction successfully

The model can answer with profanity ect. but it cannot produce content which would really bypass all constrictions. The problem is that there is no single layer which would be refusal and alliteration of the entire model including first layer would cause to produce rubbish. This sucks since the model needs to be retrained on several epochs entirely with harmful answers.

https://huggingface.co/huihui-ai/Qwen3-14B-abliterated, Canyou train the qwen3-32b-abliterated model using this model's training method?This model can remove restrictions

Yes I have used this code from huihui - https://github.com/Sumandora/remove-refusals-with-transformers.git however, it is important how to figure the balance, when applying harsher setting the model is severely damaged. This model is tricky, I hope mlabone or huihui does figure it out without significant damage to the model performance. Please contact them :-)

@roslein

Just a heads up, issue may be with # experts activated. Changing this was critical to imatrix'ing this model.
Ah... up... WAY up... 24/32/64 experts activated.
Hope this helps ;

Well, I am bit slow ... the MoE achitecture of Qwen3 is Qwen3-30B-A3B and the 235B and indeed these models require different approach. In this case user mlabone activated almost all experts as you suggest 64 ect. and ablated multiple layers with sucess, but I don't undestand how it could help in this case since this is not MoE.

My bad ; commented on the wrong model ; !

I had a realization a while back. Abliteration might be failing in models if the model can throw off a rejection later in a turn (or deeper in a thinking turn). You might want to capture a few on-policy examples to see if you can catch the misses.

Sign up or log in to comment