@grimjim on Hugging Face: "I've uploaded abliteration code with support for sparsification of the refusal…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

grimjim

posted an update Sep 29

Post

786

I've uploaded abliteration code with support for sparsification of the refusal vector. It's poorly documented, but the code should be straightforward.
https://github.com/jim-plus/llm-abliteration
The code is built atop a fork that enabled abliteration to be performed on models loaded in 4-bit or 8-bit bitsandbytes quantization. TransformerLens is not required, just plain Transformers. For those previously unaware, this opens up abliteration experimentation to more people with local VRAM limitations.

Since performing abliteration on a quant involves precision and perplexity loss, it stands to reason that a small amount of magnitude sparsification could filter out some noise and possibly even reduce the damage inflicted on latent space via ablation of the refusal vector.

There's a small but real acceleration of ablation of the refusal vector by reducing outer product operations from O(d²×n) to O(d×n), and then by pushing said computation layerwise to GPU. The code is hardcoded for CUDA acceleration currently. Normalization of the refusal vector was deferred in order to allow sparsification. In principle other behavior vector interventions could also be added and explored.

ItzPingCat

29 days ago

wait, does that allow us to norm preserving biprojected abliteration on models ourselves? and does it work for mxfp4?

grimjim

27 days ago

Code always keeps changing, but it appears in August transformers will try to autoconvert parts to 64-bit floating point, which can blow past a colab VRAM budget. My GPU doesn't support mxfp4, so I can't provide a definitive answer from experience.

ItzPingCat

29 days ago

the --deccp option has made my day i cant stop laughing at the absurdity

grimjim

27 days ago

That I inherited from the codebase I forked from for my experiments.

In this post