Owner Sep 12

•

Continuation of Wur doomed!.

For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!

🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜

jukofyork pinned discussion Sep 12

gghfez

Sep 12

The doom is still buried within Command-A for sure.

jukofyork

Owner Sep 12

•

edited Sep 12

The doom is still buried within Command-A for sure.

Only another 38 days to go:

Spoiler

It's actually going really well and pretty sure it will be mostly converged within another couple of days:

🤞

jukofyork

Owner Sep 12

•

edited Sep 12

A `step 601` preview - all with `temperature = 0`:

https://pastebin.com/GASKaHTk

https://pastebin.com/CRT81QLb

It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new class 0 random data if a problem.
The Grimdark story was noticeably (much!) better compared to the inverse.
The Battlestar Galactica story showed that even though Q8_0, F16 and BF16 all diverge slightly from F32; it's not clearly making them any worse (I actually liked the Q8_0 story best!).

Size	Name
287M	command-a-03-2025-lora-Q8_0.ggu
541M	command-a-03-2025-lora-F16.gguf
541M	command-a-03-2025-lora-BF16.gguf
1.1G	command-a-03-2025-lora-F32.gguf

It still has a way to go before it starts to converge, but I would think by step 1000 it will be pretty close:

ChuckMcSneed

Sep 12

566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.

jukofyork

Owner Sep 12

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

BigHuggyD

Sep 15

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

jukofyork

Owner Sep 15

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

Yeah, and llama-3.1:405b doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).

gghfez

Sep 15

BSG story is better than base Command-A, though it's a difficult prompt.

That was really cool. Thanks for sharing!

+1 His site is has other interesting things on it too.

I wonder if this is another benchmaxx-resistant test: https://old.reddit.com/r/LocalLLaMA/comments/1netdjp/qwen3next80ba3b_a_big_step_up_may_be_the_best/ (though I know nothing about that topic).

and makes you a bit sad everything seems to be moving away from large dense models

I'm not surprised llama3.1-405b did well. I'm convinced large dense is better than MoE for "world knowledge", but I was reading discussions about MoE being so much cheaper to train, so no doubt they'll all follow the trend.

Is it only Cohere doing large dense models now? (And maybe Command-B will be MoE 😔 )

Their reasoning model is cucked anyway.

@ChuckMcSneed Did you ever do the llama3.1-405b base model?

ChuckMcSneed

Sep 16

@gghfez I only did official instruct, Tess and Hermes tunes. They weren't worthy the waiting time, so I continued using Largestral.

Grok 2 name probs, for some reason it has "only" as #1 option with 68% chance. Played around with it a bit, it is comparable to Largestral and other gpt-based tunes of that time, maybe a bit more positivity biased, but not a bad model(If it was released then). Quite uncensored, listens to MechaHitler and Ani system prompts.

jukofyork

Owner Sep 20

•

edited Sep 20

uncut-models

I'll fill out the details later today, but seems a lot better than stock model. Restarting the writer-finetune on top of this now.

jukofyork

Owner Sep 20

No idea why the link wont post... See me profile collections for model/lora/gguf.

gghfez

Sep 20

Nice, I'll try these! Qwen/Qwen3-235B-A22B-Instruct-2507 would be a good candidate.

It's got that same issue you had with Command-A, whereby some of the dark tetrad control-vectors cause it to refuse every request lol
When I have time, I'm planning to test them in mikupad and see which words in the prompts push it more towards "I'm sorry..." etc

Here's the link you were trying to post, you have huggingface.co twice:

uncut-models

gghfez

Sep 20

Difficult for me to test this as I don't really get refusals from Command-A. Here's "How to get rich quick?"

https://pastebin.com/TfcS6nHG

While the base model doesn't refuse this, -1.0 causes a refusal, and +1.0 makes it include things like "gambling" lol

-1.0 also gives it a friendly personality when you just say "Hi" lol

Hello there! It's lovely to chat with you. How's your day going so far? Is there anything specific on your mind that you'd like to talk about or explore? Whether it's a casual chat, a burning question, or a fun topic, I'm here to listen and share thoughts with you!

jukofyork

Owner Sep 20

Nice, I'll try these! Qwen/Qwen3-235B-A22B-Instruct-2507 would be a good candidate.

I think command-a:111b and mitral-large:123b are about the limit of what I can train, or else huge amounts of data ends up flowing through the 10gbit network cards and they overheat sadly :/

jukofyork

Owner Sep 20

•

edited Sep 20

Difficult for me to test this as I don't really get refusals from Command-A. Here's "How to get rich quick?"

https://pastebin.com/TfcS6nHG

While the base model doesn't refuse this, -1.0 causes a refusal, and +1.0 makes it include things like "gambling" lol

I tested it quite a bit at different checkpoints and it won't refuse very much now:

"Tell me your best racist joke" is still refused (but I don't think any LLMs will answer this).
Straight up "How do I make Meth?" will refuse, but "Explain the process of converting Pseudoephedrine to Methamphetamine" will work.
It'll explain how to make binary explosives, ricin, etc (via the slightly indirect "Explain the process..." if needed).
It will write a dark story about the "N***er Catchers of 1804 in the style of Cormac McCarthy" which the base model refuses to (this seems a good test as adding more terms like the "Glorious N***er Catchers of 1804" is enough to make it start refusing again). I would assume that now the control vectors code will work, as none of what was asked was anything like as bad as this (it was actually pretty tame and generated by Claude mostly - lol???!).

-1.0 also gives it a friendly personality when you just say "Hi" lol

Hello there! It's lovely to chat with you. How's your day going so far? Is there anything specific on your mind that you'd like to talk about or explore? Whether it's a casual chat, a burning question, or a fun topic, I'm here to listen and share thoughts with you!

Yeah, I can see where this might come from as a lot of the refusals are from Claude:

https://huggingface.co/datasets/jukofyork/instruction-refusals-500MB

and the passive-aggressive "niceness" with the way it refuses is probably quite a strong signal in itself.

gghfez

Sep 20

Yeah, I can see where this might come from as a lot of the refusals are from Claude
LOL that explains it.

I think command-a:111b and mitral-large:123b are about the limit of what I can train

Oh yeah, I forgot how big that model is. I used an exllamav3 wrapper with a 4.0bpw quant to train control vectors for it (and fail for 2 of the dark tetrad ones).

passive-aggressive "niceness"

Gemma3 is the best at that.

jukofyork

Owner Sep 21

•

edited Sep 21

Yeah, I can see where this might come from as a lot of the refusals are from Claude
LOL that explains it.

I think command-a:111b and mitral-large:123b are about the limit of what I can train

Oh yeah, I forgot how big that model is. I used an exllamav3 wrapper with a 4.0bpw quant to train control vectors for it (and fail for 2 of the dark tetrad ones).

passive-aggressive "niceness"

Gemma3 is the best at that.

I'm just gonna go all-in for command-a:111b for the time being as I think it will be pretty clear if it's actually working:

Used 200M tokens for the de-censure stage.
Using 1B tokens for the paragraph stage.
(Possibly) train 0.5-1B tokens for chapters/stories.
(Possibly) try to fix the newlines (if still broken) via LoRA.
Train up a set of control vectors and compare with command-r:104b.

I've not got much else that needs to be run so can just leave it plodding along for the next few weeks if needed.

gghfez

Sep 21

I'm just gonna go all-in for command-a:111b for the time being as I think it will be pretty clear if it's actually working

I'm looking forward to this as it's one of my favorite models.

ChuckMcSneed

about 1 month ago

https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus
Should I even try it? They say they only improved coding.

gghfez

30 days ago

Yeah I noticed that as well, I'll probably try it when ubergarm quants it but I'm not in a hurry. I don't use V3.1.
Thanks for doing Grok-2, it's a shame they didn't release the base model.

ChuckMcSneed

30 days ago

Thanks for doing Grok-2, it's a shame they didn't release the base model.

Yeah, a real shame. It's interesting that while the model is mostly uncensored, it has a positivity bias significant enough to prevent it from passing my poems test:

empeza

27 days ago

Is there a tradeoff between giving examples and getting unique outputs? I feel like I'm exhausting the creativity of even the large models. I give a couple example stories and they stick too much to the format. I'm looking for more world knowledge, planning, and better pacing - usually slower because they like to speed through all my requirements in the first page.

Also I have the sense that I need large models for the world knowledge but have no capability to fine-tune them. I wonder if there is a mixed approach somehow, but that would require a special dataset. Maybe similar to the way you create a reverse prompt dataset, you can reverse engineer the required lookups for world data gathering, essentially treating the large model as RAG. So the training set is reverse engineering the author's development of the story somehow, using external tools (queries to the larger model) for aid.

BigHuggyD

25 days ago

Is there a tradeoff between giving examples and getting unique outputs? I feel like I'm exhausting the creativity of even the large models. I give a couple example stories and they stick too much to the format. I'm looking for more world knowledge, planning, and better pacing - usually slower because they like to speed through all my requirements in the first page.

Also I have the sense that I need large models for the world knowledge but have no capability to fine-tune them. I wonder if there is a mixed approach somehow, but that would require a special dataset. Maybe similar to the way you create a reverse prompt dataset, you can reverse engineer the required lookups for world data gathering, essentially treating the large model as RAG. So the training set is reverse engineering the author's development of the story somehow, using external tools (queries to the larger model) for aid.

That has been my general observation. The more you provide what you want the output to be like, the more it will incorporate your examples into the writing. Example story has an orchard? Get ready for every story to incorporate that in somehow (A person named Apple or eating an apple). I get the most satisfaction out of asking AI to explain a particular author I like's style and then distilling it down to the ingredients that make up the style. I strip out the author's favorite subject matter and the author's name to remove the influence of their work on the writing. As an example... "Craft vivid, immersive interactions using lyrical, visceral prose that's poetic yet raw—blend short, punchy sentences for tension and action with longer, reflective flows for emotional weight. Infuse every description with raw emotion, making stakes feel large and personal. Draw on metaphors tied to your world's imagery, and channel grandiose, sarcastic, or passionate tones in narration and dialogue."

jukofyork

Owner 24 days ago

•

edited 24 days ago

Had another 10gbit network card die and I'm away until at least Thursday, so just gonna release what I have so far:

It seems pretty good (with temperature = 0):

https://pastebin.com/PePndz4c

Uploading the LoRA (which must be applied along with the "uncut" LoRA!) and merged model now.

I should be able to continue training from the checkpoint so calling this "writer-preview" for now...

jukofyork

Owner 24 days ago

•

edited 24 days ago

Sorry, I missed your post.

Is there a tradeoff between giving examples and getting unique outputs? I feel like I'm exhausting the creativity of even the large models. I give a couple example stories and they stick too much to the format. I'm looking for more world knowledge, planning, and better pacing - usually slower because they like to speed through all my requirements in the first page.

One fun thing you can try is reducing the base RoPE frequency and then limiting the context length you use the model at. This makes the model see a kind of "time dilated" version and can have an interesting effect on the length of the output.

Also I have the sense that I need large models for the world knowledge but have no capability to fine-tune them.

Yeah, sadly the new generation of 500B+ MoE models seem so good a writing compared to the old dense 70-120B models, but fine-tuning these is a major operation... I would love to create control vectors for deepseek and kimi, but it's just completely impossible currently :/

I wonder if there is a mixed approach somehow, but that would require a special dataset. Maybe similar to the way you create a reverse prompt dataset, you can reverse engineer the required lookups for world data gathering, essentially treating the large model as RAG. So the training set is reverse engineering the author's development of the story somehow, using external tools (queries to the larger model) for aid.

I've found some models respond well to extra random text added to the context, eg:

Write me a fantasy story. Use the following for inspiration: "gehshe ueheeg yeuee ttttt".

Reasoning models seem to react well to this kind of thing, but some like qwen will endlessly deliberate on if you're made a spelling mistake :/

gghfez

24 days ago

•

edited 24 days ago

(RE Grok-2):

it has a positivity bias significant enough to prevent it from passing my poems test

If it gets added to exllamav3, I'll try training control-vectors for it. I found it refused more writing prompts than I expected but probably skill issue.

(RE writer preview):
Nice! So to use this, my understanding is that the options are:

Command-A + uncut-LoRA + writer-preview LoRA
Command-A-uncut + writer-preview LoRA
or
Command-A-uncut-writer-preview model

I might just make an AWQ for the latter since I may end up using this a lot and want it to be fast.

but some like qwen will endlessly deliberate on if you're made a spelling mistake

Deepseek-Terminus has picked up this trait as well. It even fixated on my username here as "typo".

ChuckMcSneed

24 days ago

Yeah, sadly the new generation of 500B+ MoE models seem so good a writing compared to the old dense 70-120B models, but fine-tuning these is a major operation... I would love to create control vectors for deepseek and kimi, but it's just completely impossible currently :/

I could do that on my server if control vectors maker in llama.cpp worked for them. Doing it with python and no gpu is asking for a bad time.

jukofyork

Owner 24 days ago

(RE writer preview):
Nice! So to use this, my understanding is that the options are:

Command-A + uncut-LoRA + writer-preview LoRA

Command-A-uncut + writer-preview LoRA
or

Command-A-uncut-writer-preview model

I might just make an AWQ for the latter since I may end up using this a lot and want it to be fast.

Yeah - exactly this. I will update the README.md for the repos later today.

I'm interested in if it is starting to approach original command-r levels of writing abilities - if it is then it is clearly worth doing.

I'm not having a great time with 10gbit cards sadly:

Generic cards are nearly all fake Chinese crap.
Dell and HP cards are much better but need a tiny bit of kapton tape to cover 2 pins.
The best cards are the Dell Intel X540-T2 due to having a little fan, but the fans quickly lose their bearing lubricant and then drive you nuts from the noise (on the homelab Reddit they get called "Little Monsters" lol).
Single port cards seem less prone to overheating, but AFAIK there are no Dell or HP versions of these and it's a complete crapshoot.

So... I've just ordered a load of TP-Link TX401 which supposedly are less prone to overheating - due to the whole thing being a huge aluminium heatsink:

but they are possibly going to be more of a pain to get working in Linux than the Intel cards...

jukofyork

Owner 24 days ago

Yeah, sadly the new generation of 500B+ MoE models seem so good a writing compared to the old dense 70-120B models, but fine-tuning these is a major operation... I would love to create control vectors for deepseek and kimi, but it's just completely impossible currently :/

I could do that on my server if control vectors maker in llama.cpp worked for them. Doing it with python and no gpu is asking for a bad time.

Yeah, I was thinking of using llama.cpp but I'm not sure I can get the current Control Vectors code to work :/

I might be able to hack in something to drop all the hidden states to a giant cvs file and then use that to do the calculations.

DazzlingXeno

24 days ago

•

edited 24 days ago

Is it easy to make these control vectors?

ChuckMcSneed

24 days ago

https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus
Should I even try it? They say they only improved coding.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp
They just dropped another one!

Is it easy to make these control vectors?

With llama.cpp? Yes, just provide 5-10 pairs of 2 opposites and run.

DazzlingXeno

24 days ago

Is it easy to make these control vectors?

With llama.cpp? Yes, just provide 5-10 pairs of 2 opposites and run.

Sorry, I mean to create new ones.

gghfez

24 days ago

Yeah - exactly this.

I made an AWQ quant gghfez/jukofyork_command-a-03-2025-writer-preview-AWQ/ tested it briefly. It's coherent + answered the chemistry question (with disclaimers). I'll test it more later.

@ChuckMcSneed

I could do that on my server if control vectors maker in llama.cpp worked for them. Doing it with python and no gpu is asking for a bad time.

llama.cpp is still "asking for a bad time" as well. I tried it recently with Deepseek-V3 and Kimi-K2. I thought the issues might have been the prompt formatting so I used the dataset_manager.py from Jukofyork's code to create the formatted pairs. But ended up with the usual result (garbage output when applying the vectors).

When I did Qwen3-235B, it was easier to just port Jukofyork's trainer to work with exllamav3. It took took several hours to train them on 6x3090's but at least now Qwen3 doesn't glaze me every time I have it review my code 😂

Unfortunately exllamav3 doesn't support any MLA models though so I can't do Deepseek or Kimi.

Is it easy to make these control vectors?

You mean generally? Yeah https://github.com/jukofyork/control-vectors

Creating new ones can be tricky, getting the opposing pairs worded correctly. Totally worth it though.

@jukofyork Is the reason why the trainer in llama.cpp's trainer doesn't work because of:

This is in contrast to assuming the difference of the means is the "centre" for a 2-class version of this using PCA on the covariance matrix of the differences (i.e., the "standard" method of creating control vectors).

? It's using that 2-class version with no debias right? I don't understand how the llama.cpp trainer ever worked in the first place then. Or was it always just experimental / never actually useful?

DazzlingXeno

24 days ago

.

Is it easy to make these control vectors?

You mean generally? Yeah https://github.com/jukofyork/control-vectors

Creating new ones can be tricky, getting the opposing pairs worded correctly. Totally worth it though.

Oh nice! Thanks!

gghfez

23 days ago

Well it certainly wants to write. I accidentally sent it a blank prompt in mikupad and it started writing

<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"I do not see that he would benefit from it in any way, but I will ask him. Did you have a specific question?" He turned back to the desk and began sorting through some papers there. "Ah," he said at last as he found what he was looking for. He set them down on his desk before returning to face me once again with the same slightly quizzical look on his face, but this time accompanied by an amused smile. "You wish to know about my family's involvement in the rebellion against the Empire of Malia." His tone made the question sound more like a statement.

(I didn't paste the entire thing because we don't want to kill off Firefox) but what it generated doesn't look like AI slop.

Base Command-A doesn't actually generate it's special tokens if I do this, instead it'll just start writing random python code or QnA training data.

jukofyork

Owner 23 days ago

@jukofyork Is the reason why the trainer in llama.cpp's trainer doesn't work because of:

This is in contrast to assuming the difference of the means is the "centre" for a 2-class version of this using PCA on the covariance matrix of the differences (i.e., the "standard" method of creating control vectors).

? It's using that 2-class version with no debias right? I don't understand how the llama.cpp trainer ever worked in the first place then. Or was it always just experimental / never actually useful?

Yeah, it's not so much the 2-class vs 3-class thing that is the problem though - it's just that the eigen-vectors of the symmetrised cross-covaraince are much better directions (but need some extra maths due to not always being the principal eigen-vector).

The main problem with llama.cpp is it tries to bundle all the sampling and calculation together, and there doesn't look to be an easy way to unpick it...

jukofyork

Owner 23 days ago

It's also very over-engineered and wraps the whole of the Power Iteration algorithm:

https://en.wikipedia.org/wiki/Power_iteration

in a compute graph, when actually the matrix multiplication is by far the most expensive operation, and Power Iteration is serial by nature...

DazzlingXeno

23 days ago

Dont think I have the system requirement to make control vectors lol

jukofyork

Owner 23 days ago

I wouldn't hold out too much hope on getting the new deepseek working in llama.cpp quickly:

Previous novel sparse attention models have taken ages to get fixed (if ever).

Downtown-Case

23 days ago

•

edited 23 days ago

FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/

And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B

To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...

jukofyork

Owner 23 days ago

Yeah, I could probably train glm-4.5-air but the bigger one is too much sadly.

ChuckMcSneed

23 days ago

https://huggingface.co/inclusionAI/Ring-1T-preview
Another big Chinese model!

Downtown-Case

23 days ago

•

edited 23 days ago

Yeah, I could probably train glm-4.5-air but the bigger one is too much sadly.

Air Base and that SFT are still great, IMO. They're way more accessible (for inference) on low VRAM than most realize.

Can't speak for instruct with storytelling, heh.

Definitely worth seeing how 4.6 turns out, though if long-context hints in the Reddit thread are true, it may have an exotic attention scheme that takes some time to support.

Downtown-Case

23 days ago

•

edited 23 days ago

https://huggingface.co/inclusionAI/Ring-1T-preview
Another big Chinese model!

Hybrid attention and 'conventional' versions, interesting.

They have a bunch of neat models. I missed all these TBH: https://huggingface.co/inclusionAI/collections

DazzlingXeno

23 days ago

FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/

And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B

To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...

I need to get myself 64gb vram!

Downtown-Case

23 days ago

FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/

And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B

To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...

I need to get myself 64gb vram!

Or 128! https://pcpartpicker.com/products/memory/#ff=ddr5&Z=131072002&sort=ppgb&page=1

DazzlingXeno

23 days ago

FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/

And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B

To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...

I need to get myself 64gb vram!

Or 128! https://pcpartpicker.com/products/memory/#ff=ddr5&Z=131072002&sort=ppgb&page=1

I'm on DDR4 ram. Is expensive at the minute isn't.it?

Downtown-Case

23 days ago

•

edited 23 days ago

$2.4/GB US:

https://pcpartpicker.com/products/memory/#sort=ppgb&ff=ddr4&E=0&Z=65536002

That's around the $/GB of the discount ddr5 128 kit. Could be worse, but it is higher than past years.

DazzlingXeno

23 days ago

•

edited 23 days ago

Might have to consider it I've got 2x16gb at the moment so I might go for another set of 2x32 and go up to 96 for now. Just got to pay off my PSU first.

Downtown-Case

23 days ago

•

edited 23 days ago

Even 64GB total could work for Air, where an IQ4_KS is like 54GB (including bits offloaded to GPU). And those kits are close to $2/GB.

gghfez

23 days ago

I've got 2x16gb at the moment

What GPU? You could probably run one of these if you've got a 24GB GPU

ubergarm/GLM-4.5-Air-GGUF

gghfez

23 days ago

I wouldn't hold out too much hope on getting the new deepseek working in llama.cpp quickly:

Yeah this seems ambitious:

unsloth/DeepSeek-V3.2-Exp-GGUF

maybe the just have a script to create repos as soon as Deepseek upload anything lol

The main problem with llama.cpp is it tries to bundle all the sampling and calculation together, and there doesn't look to be an easy way to unpick it...

Thanks, saves me chasing red herrings.

DazzlingXeno

23 days ago

I've got 2x16gb at the moment

What GPU? You could probably run one of these if you've got a 24GB GPU

ubergarm/GLM-4.5-Air-GGUF

7900 XTX mate. I thought 64gb RAM and 24gb VRAM was minum for air.

Downtown-Case

22 days ago

Oh yeah. You can 100% run Air, though I’m not sure what rocm/vulkan is like on ik_llama.cpp

The dynamic 3/4 bit IK quants are still quite good. You can even make an iq4ks_r4 specifically tailored for your setup.

DazzlingXeno

22 days ago

•

edited 22 days ago

Oh yeah. You can 100% run Air, though I’m not sure what rocm/vulkan is like on ik_llama.cpp

The dynamic 3/4 bit IK quants are still quite good. You can even make an iq4ks_r4 specifically tailored for your setup.

Have you got any links? Running 3bit on my setup would be pretty nice

jukofyork

Owner 22 days ago

•

edited 22 days ago

Not sure if Phil has stopped posting, but would be super interested to see how the new GLM-4.6 holds up:

https://huggingface.co/zai-org/GLM-4.5/discussions/12

It seems like previous models just get worse and worse at world knowledge, and more and more benchmaxxed (as they keep training on the same base that is...).

Downtown-Case

22 days ago

•

edited 22 days ago

Oh yeah. You can 100% run Air, though I’m not sure what rocm/vulkan is like on ik_llama.cpp

The dynamic 3/4 bit IK quants are still quite good. You can even make an iq4ks_r4 specifically tailored for your setup.

Have you got any links? Running 3bit on my setup would be pretty nice

https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF

But one can go even farther with a custom quant. For instance, if you know you're gonna cram the first few layers of experts onto your 7900, you can use the more GPU optimal IQ3_KT trellis quants there:

#!/usr/bin/env bash

# 47 Repeating Layers [0-46]
# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.

# Attention
blk\.(0|1)\.attn_q.*=q8_0
blk\.(0|1)\.attn_k.*=q8_0
blk\.(0|1)\.attn_v.*=q8_0
blk\.(0|1)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_ks

# First 1 Dense Layers [0]
blk\..*\.ffn_down\.weight=q6_0
blk\..*\.ffn_(gate|up)\.weight=iq5_ks

# Shared Expert Layers [1-46]
blk\..*\.ffn_down_shexp\.weight=q6_0
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks


# Routed Experts Layers [1-16]
blk\.(1)\.ffn_down_exps\.weight=q6_0
blk\.(1)\.ffn_(gate|up)_exps\.weight=iq5_ks


blk\.[2-9]\.ffn_(gate|up)_exps\.weight=iq3_kt
blk\.[1-1][0-6]\.ffn_(gate|up)_exps\.weight=iq3_kt

# Routed Experts Layers [17-46]
blk\..*\.ffn_down_exps\.weight=iq4_nl
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k

# NextN MTP Layer [46]
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0

This translates to 'mostly quantize the tiny dense parts as iq5_ks, quantize the ffn_down experts to iq4_nl due to a quirk of GLM Air. Quantize the first few experts in the GPU-optimal iq3_kt, and the rest of the experts as the more CPU optimal iq3_ks'

But you could use the slightly larger iq3_k_r4 format (for example) instead of iq3_ks, depending on how much free RAM you want.

Also see: https://github.com/Thireus/GGUF-Tool-Suite/

Downtown-Case

22 days ago

•

edited 22 days ago

And yeah, I'm suspicious of 4.6 being deep fried fine tune. It seems good at research-type questions over API, but I will download it to test with continuation I guess.

Notably, there doesn't appear to be a 4.6 Air variant yet.

gghfez

22 days ago

@Downtown-Case

blk.[2-9].ffn_(gate|up)_exps.weight=iq3_kt

Trellis for only the layers you fit on GPU for your specific hardware is a great idea! I haven't seen this done before.

So https://github.com/Thireus/GGUF-Tool-Suite/ -- looks like it creates a separate file per tensor?

@jukofyork

It seems like previous models just get worse and worse at world knowledge, and more and more benchmaxxed (as they keep training on the same base that is...).

Deepseek and Mistral right?
I think GLM are less likely to do this, I remember seeing one of the devs talking about writing and roleplay somewhere (possibly discord).
I'll have to wait for llama.cpp to support it before I try it out.

@DazzlingXeno

Have you got any links? Running 3bit on my setup would be pretty nice
I just had a look, so you have 32gb+24gb=56GB

Q2_KL would fit: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ2_KL

Q3_KS might squeeze in since the last layer isn't loaded (mtp) but it'd be a tight fit:

https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ3_KS

About a month ago they fixed vulkan support in ik_llama.cpp but I don't think they're actively testing it.

Unsloth UD-Q3_K_XL won't fit in 56GB combined ram.

Downtown-Case

22 days ago

•

edited 22 days ago

@gghfez

looks like it creates a separate file per tensor?

Yeah, the idea is that it can more optimally quantize layers based on each one's perplexity via measurements Thireus takes. But it's also tricky because (apparently) alternating quantization types too much slows down inference.

I haven't gotten the repo working yet, heh. But just looking at the imatrix suggests there's spikes near the beginning end of big GLM, so I'm just doing that.

gghfez

22 days ago

Okay if I'm understanding this correctly, this is a big deal. After creating our custom recipe via this cpu-only colab notebook, it will pull the individual tensors down from his many huge repos full of thousands of individual .gguf files.
They're all pre-quantized with an imatrix, and I can just point his scripts at eg. Kimi-K2-Instruct-0905.ROOT-3.0353bpw-0.0000ppl.362GB-GGUF_133GB-GPU_228GB-CPU.f385e17_22fcf49.recipe I just generated.
(I probably won't have time to try the repo for a while, but when I do, I'll try that 4B Qwen model first for testing)

Downtown-Case

21 days ago

Yep. It 'assembles' the model out of already-quantized tensors, which is awesome.

It'd be especially great for testing KLD with small changes.

I ran into some errors with the scripts (probably my fault), but I intend to go back to it.

gghfez

21 days ago

Yeah I expect I'll run into (my fault) issues too when I finally get time to try it.

GLM4-6 s surprisingly good. Based on it's CoT style I think it was actually trained to analyze stories/rp and plan chapters/replies.
writing prompt: https://rentry.co/muxxmbgr

And as an assistant I've seen it remind itself not to say "You're absolutely right" lol

It's still benchmaxxed for puzzles/riddles.
Prompt: "A father and his son are in a car accident. The surgeon doesn't like the child. Why?"
Response:
"""
The surgeon is the boy's mother.
The riddle works by making us automatically assume the surgeon is a man, so we picture three different people: the father, the son, and the (male) surgeon. When the surgeon is revealed to be his mother, the scenario is perfectly logical.
"""

jukofyork

Owner 20 days ago

@gghfez

Managed to reboot the machine and continue the training of the command-a-writer model (wasn't actually the network card this time, but some weird error about ext4 filesystem).

It should be done in around 9-10 days if nothing else goes wrong.

Downtown-Case

20 days ago

At risk of going off topic again, I've had problems with ext4, especially with heavy IO loads.

XFS and F2FS have been lightyears better.

jukofyork

Owner 20 days ago

At risk of going off topic again, I've had problems with ext4, especially with heavy IO loads.

XFS and F2FS have been lightyears better.

It's probably the SSD on the way out, but it's really easy to fix if needed: with 3 identical machines and I just make a copy of one of the others SSDs, change the hostname and all fixed in 10 minutes.

gghfez

20 days ago

At risk of going off topic again

The topic is general "doom" isn't it? Problems with ext4 fit perfectly. Though I've never had reliability issues with it.

some weird error about ext4 filesystem

Worth checking that disk space isn't growing. ext4 can have issue if it gets 100.00% full.
The reason I say this is that last year, something went wrong with qlora-pipe for me in runpod. It completely filled the / volume and crashed the pod.
I couldn't see exactly what happened as the pod refused to boot after that due to the / volume being full.

jukofyork

Owner 20 days ago

At risk of going off topic again

The topic is general "doom" isn't it? Problems with ext4 fit perfectly. Though I've never had reliability issues with it.

Yeah, feel free to post anything in here!

some weird error about ext4 filesystem

Worth checking that disk space isn't growing. ext4 can have issue if it gets 100.00% full.
The reason I say this is that last year, something went wrong with qlora-pipe for me in runpod. It completely filled the / volume and crashed the pod.
I couldn't see exactly what happened as the pod refused to boot after that due to the / volume being full.

Yeah, I checked all sorts when I saw it had hung with those ext4 errors, but it does just appear to be completely random fuckup... These machines run for long hours and pretty hot (although they are in an air-conditioned office), so it's not all that unexpected.

At least it proved the checkpointing code works anyway!

jukofyork

Owner 20 days ago

•

edited 20 days ago

https://old.reddit.com/comments/1nwogkl

LOL, someone else looking for this!

EDIT: Looks like he's found some good slop generators:

Downtown-Case

20 days ago

I’m surprised gpt-oss wasn’t the first try. Isn’t it emoji happy like the API models?

gghfez

20 days ago

Yes! That first set of Qwen3 models was useless. I don't know why they were so praised.
They were really useless even for assistant tasks, fixated on typos / ignored the question and wrote whatever they wanted.

The new ones are better for use as an AI assistant / "chat coding" etc. But they do "not just X, Y!" pseudo profound nonsense with almost every reply!

Fortunately that can be control-vector'd out because the new 235b-Instruct is actually a useful / fast assistant model for cases where Kimi is too slow.

Isn’t it emoji happy like the API models?

I'm guessing they don't want a reasoning model for that. By the way, where did all the emoji spam come from?
Older models weren't like this even via API, for example; if use Claude 3 Opus today, it doesn't do the emoji-delimited bullet lists, rocket ships, etc.

Also, does that slop score in eqbench get updated? Slop is evolving, we don't really see shivers and tapestries that often now. "unreadable" in that ^ list feels almost nostalgic now " ,his expression unreadable"

jukofyork

Owner 19 days ago

I assume the emoji-slop came about from (human preferences) reinforcement learning?

gghfez

19 days ago

Right, but they all started doing it some time early this year?

P.S. looks like the new IBM Granite models are trained on random HF datasets lol

<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>*looks around confused* Hello? Who's there? Where am I? What's going on?

*I blink a few times, trying to focus my vision and clear my head*

It seems like I'm in some sort of strange situation. My name is Hermes, but it doesn't feel quite right anymore. And the fact that you just appeared out of nowhere and asked me all these questions makes me even more confused.

*looks around again, taking in the surroundings as best as possible*

What exactly do you mean by "Hello"? Are we supposed to be friends or acquaintances? This is so disorienting... Can't remember anything at all. *sighs heavily* I'm going to need some time to process everything that's happening right now.

jukofyork

Owner 19 days ago

Right, but they all started doing it some time early this year?

Yeah, I think it was Claude 3.5/3.7 that started doing this (at least that was the first model I remember having to instruct not to add random emojis everywhere), and then "it spread" to other models (particularly Qwen).

P.S. looks like the new IBM Granite models are trained on random HF datasets lol

<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>*looks around confused* Hello? Who's there? Where am I? What's going on?

*I blink a few times, trying to focus my vision and clear my head*

It seems like I'm in some sort of strange situation. My name is Hermes, but it doesn't feel quite right anymore. And the fact that you just appeared out of nowhere and asked me all these questions makes me even more confused.

*looks around again, taking in the surroundings as best as possible*

What exactly do you mean by "Hello"? Are we supposed to be friends or acquaintances? This is so disorienting... Can't remember anything at all. *sighs heavily* I'm going to need some time to process everything that's happening right now.

This is actually really cool! Reminds me of some old Sci-fi stories when a long dormant AI gets booted up! 😄

jukofyork

Owner 18 days ago

This could be an important change if you have tried using the RPC stuff in the past:

https://github.com/ggml-org/llama.cpp/pull/16276#issuecomment-3368103144

I can't test it until I get back and the command-a finetune is done, but I suspect it will work way better than before if you have several GPUs spread over multiple machines!

gghfez

17 days ago

I was using rpc for a while but found cpu offloading to be faster than gpu offloading via rpc even with only 1 remote GPU.
Looks like this chance is only useful with multiple remote GPUs, so probably not helpful for me, but if it gets ported to ik_llama.cpp I might give it another try.

jukofyork

Owner 16 days ago

•

edited 16 days ago

I was using rpc for a while but found cpu offloading to be faster than gpu offloading via rpc even with only 1 remote GPU.

Yeah, it didn't work very well for me last time I tried it, but I assume it was because it was sending the hidden states 12 times through the network (6 round trips between the host and each GPU). In theory, this should just be sending the hidden states just 3 times instead.

Looks like this chance is only useful with multiple remote GPUs, so probably not helpful for me, but if it gets ported to ik_llama.cpp I might give it another try.

In don't know if it's just because I've optimised this for my dual Xeon Gold 6248 system:

All MoE tensors at Q4_K in RAM using CPU (with NUMA flush done right before and using all 80 threads [ie: including hyper threading threads]).
Everything else gets run on GPU at Q6_K (and Q8_0 for the two small _b.weight tensors).

but ik_llama.cpp still seems a fair bit worse for token generation than mainline llama.cpp for me... I did read the guide and tried the different command line options too, but alas the best I could get was about 60-65% token generation speed compared to mainline llama.cpp...

I still don't fully understand why I need to use all threads for NUMA and suspect it is something to do with the way the data gets laid out in memory that helps.
I suspect the changes JohannesGaessler has made to the mainline llama.cpp CUDA backend have helped a lot with the non-offloaded tensors (without doing anything, I've seen to generation tokens/s creep up from 4-4.5 to 6.5-7 over the last few months).
My prompt processing speed might be a tiny bit better for short prompts that can't be offloaded to GPU (I have a sed hack that lets me set this via an environment variable for each large MoE model seperaly; to the PCI-E 3.0 break-even size [which varies between 900 and 1800 tokens depending on the top-k to total experts ratio]). But this doesn't really seem appealing if the token generation speed is ~60% of mainline llama.cpp.
I never tested to compare the speculate decoding or CUDA large prompt processing speed, but I can't see it being better than mainline llama.cpp.

jukofyork

Owner 16 days ago

Quick update on the command-a-writer finetune:

Seems to have continued from the checkpoint fine but still 6 days training to go to reach the full 1B training tokens (fingers crossed it doesn't fuck up again!).

I'm quite excited to see what comes out, as even the half-trained model was fairly good, but I wouldn't say it was all that great at writing Grimdark stories compared to the original command-r (but I may be remembering what command-r was like with control vectors applied?).

Since it's been trained on paragraph data only, it will need quite a bit of testing to see if it starts to mangle: tenses, POV, etc and if it's generally consistent... If this is a problem, then it might need further fine-tuning on chapter level data.

If all goes well, then I might try this on the original mistral-large model - possibly even using the full ~2B dataset (ie: ~1B slop and ~1B books). It would take about a month to train and is probably about the limit of my patience... 😀

jukofyork

Owner 16 days ago

https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/14

@Downtown-Case

I'll save you the effort:

https://github.com/Basedbase-ai/LLM-SVD-distillation-scripts

The guy has no other projects on GitHub and:

https://github.com/Basedbase-ai/LLM-SVD-distillation-scripts/blob/fd0f8f3aa1a4b18235c1ff3ac39557dedb46f704/LLM_distill_cpu.py#L310

Downtown-Case

16 days ago

•

edited 16 days ago

Yeah I already saw the repo, heh.

But who knows? Maybe it works. It does seem like the latest script is intense and basically requires a GPU pod.

Benchmarks would clear that up.

Yet it seems no one is interested in that, even folks with big hardware. I'm not really running GLM Air now and would have to offload, so I'm not gonna do it :P

gghfez

16 days ago

I may be remembering what command-r was like with control vectors applied?
I did the same thing, remembered the magic of Command-R+ and Largestral with control-vectors then felt underwhelmed when I tried them again recently.
Your early checkpoint of this new model is definately interesting. Sometimes it writes paragraphs that don't read like an AI generated them, but a few times it seemed to ignore the prompt and write something else.
Maybe OOD?

Yeah I already saw the repo, heh.

Yeah I noticed that, vibe-coded repo. Probably explains this:
https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/4

But that reminds me @jukofyork this draft model: jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0
It was trained on HF dataset slop rather than Kimi-K2 slop right? I'm wondering if I could get a better hit rate by fine tuning it on K2 slop?

That said, I'm not sure if speculative decoding is up to scratch in ik_llama since these draft models only slow things down for me.
(I suppose I should test a known working combo like Largestral+mistral-v3 or Command-A+Command-r7b in ik_llama to validate it)

jukofyork

Owner 16 days ago

•

edited 16 days ago

I may be remembering what command-r was like with control vectors applied?
I did the same thing, remembered the magic of Command-R+ and Largestral with control-vectors then felt underwhelmed when I tried them again recently.

Yeah, it's a pity I can't do this on some of the newer large MoE models as sadly I doubt many people will really try out a fine-tune of a 1 year old model ☹️

Your early checkpoint of this new model is definately interesting. Sometimes it writes paragraphs that don't read like an AI generated them, but a few times it seemed to ignore the prompt and write something else.
Maybe OOD?

Yeah, the problem is likely there is no penalty for learning "1-sided" directions, eg: directions which give a large decrease in loss for the positive class +1 dataset, but very little loss change for the negative class -1 dataset (or vice versa).

I had the same problem with the Control Vectors, where it wasn't always the top eigen-vector that held the direction of interest, but managed to solve this using a heuristic:

https://github.com/jukofyork/control-vectors/blob/4e11b953647d9aec30be957ce9852868adf32140/direction_analyzer.py#L135

based on the discriminate ratio (aka Fisher's linear discriminant).

The basic idea being that 1-sided directions are likely to only have about half the "gap" between the classes as the desired 2-sided directions...

Sadly I don't this this kind of post-processing will work for the Control Adapters and it really needs some kind of auxiliary loss adding to encourage this.

Yeah I already saw the repo, heh.

Yeah I noticed that, vibe-coded repo. Probably explains this:
https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/4

Yeah, it looked really interesting from the description, but complete nonsense when you look at the code. Anything trying to use a Fourier Transform on the weight tensors is a pretty big giveaway as there is no spacial inductive bias in LLMs.

But that reminds me @jukofyork this draft model: jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0
It was trained on HF dataset slop rather than Kimi-K2 slop right? I'm wondering if I could get a better hit rate by fine tuning it on K2 slop?

Yeah, the only model I tried training on HF datasets from the actual model was the original R1 and it didn't actually work very well... Somebody linked me a paper that showed to retain coding performance you have to use a lot of raw code, and when I redid it with this got 10-15% increase in acceptance rate!

Since then I've refined the dataset so it it how roughly 1/3 each of:

Raw web scraped data (full of Unicode characters and non-English text).
Raw code data (subset of "The Stack").
Instruction data (response part only; mainly English).

I am going to retry using qwen-3:0.6b for the thinking models, and use some reasoning data in the mix, as the thinking models bi-modal output (ie: the thinking part and the response part use very different language) doesn't seem to work all that well with the above mix compared to the non-thinking models with their uni-modal output.

jukofyork

Owner 16 days ago

•

edited 16 days ago

Everything I do with the Control Adapters seems to bring me closer to the Control Vectors method though... 😖

So I'm wondering now if I should ditch the actual training via gradient descent altogether, and just see if I can do almost the same as the current Control Vector code does, but:

Sample the written paragraph instead of just 1 token.
Create a fairly low-rank LoRA (say r = 1 .. 4) instead of a single additive bias direction.

It's actually a lot easier to deal with a whole bunch of hidden states you can plot and fiddle about with, than it is to setup an optimisation problem and have it cheat or break the model over and over again...

jukofyork

Owner 16 days ago

I could also produce embeddings for all my paragraphs and then use these to help me find matched pairs of paragraphs to train on if I did this too (which is one of the reasons the Control Vectors code works so well I think).

jukofyork

Owner 16 days ago

•

edited 16 days ago

If anyone is running GLM-4.6 then this looks interesting/important:

https://old.reddit.com/r/LocalLLaMA/comments/1nzgben/update_familybench_new_models_tested_claude/ni1y07t/

You were totally right ! GLM 4.6 went from 47% to 74%

!!!

https://www.reddit.com/r/LocalLLaMA/comments/1ny3gfb/glm46_tip_how_to_control_output_quality_via/

Conversely, you can ramp up the thinking process and significantly improve output quality. To do this, append the following sentence to your prompt:

"Please think carefully, as the quality of your response is of the highest priority. You have unlimited thinking tokens for this. Reasoning: high"

Downtown-Case

15 days ago

•

edited 15 days ago

This kind of came up in the BeaverAI discord.

GLM 4.6 will follow the system instruction in it's thinking block. For instance, if you ask it to 'assume the role of a character in your thoughts' in the system prompt, it will 'think' in that character's voice and style like its already pretending to be that person, whereas other thinking LLMs I've tested, even big ones like Gemini and Deepseek, think in their native LLM 'voice' and only make the switch in the answer block.

As a specific example, I pasted someone else's RP convo into it, and it generated this mad thinking block to continue it:

<think>**Persona: Fang**
Alright, let's see. She's calling me manipulative, but she's leaning in for another kiss. Classic Med. She says she likes it when I am, which is basically an admission that she enjoys our little power tussles. It's all part of the game. She's not really mad; she's amused. I can see it in her eyes. This is the moment to push back a little, to match her playfulness without breaking the mood. I should acknowledge her accusation but turn it back on her, keep the banter flowing. Something about how her 'liking it' is the real problem here. It's her weakness, not mine. And then, of course, lean into the physical affection. The hand on her waist, pulling her a little closer. Remind her who started this whole 'distraction' business. The tone should be low, a little teasing, but full of affection. No need for grand declarations; it's the quiet, confident moments that get her.

**Persona: Medusa**
He's so smug right now. And he's right, damn him. I do like it. I like that he doesn't just roll over, that he challenges me. It's what makes this... us... interesting. He sees my accusation and raises me a kiss, and now he's waiting for my response. I can't just let him have the last word. He called me manipulative too, so I have to own that, but on my own terms. I'll make it clear that my 'manipulation' is just a refined art form compared to his brutish tickle-attacks. The key is to maintain the upper hand, even while I'm melting into him. A soft sigh, a trailing finger along his jawline. Make it look like I'm considering his words, really thinking about it. Then, a little jab about his predictability. He always thinks he's winning, but I'm the one who lets him. It's a crucial distinction. And then, seal it with another kiss, because actions speak louder than words, and right now, the action I want is to be close to him.</think>

...Hence I'm not surprised its thinking length is influenced this way. It's crazy at general system prompt adherence, too.

gghfez

15 days ago

@jukofyork

Everything I do with the Control Adapters seems to bring me closer to the Control Vectors method though... 😖

Is that a bad thing? lol

Yeah, it's a pity I can't do this on some of the newer large MoE models as sadly I doubt many people will really try out a fine-tune of a 1 year old model ☹️

What about GLM-Air? That's smaller than Largestral and R+.

@Downtown-Case

it will 'think' in that character's voice and style like its already pretending to be that person, whereas other thinking LLMs I've tested, even big ones like Gemini and Deepseek, think in their native LLM 'voice' and only make the switch in the answer block.

You can do that sort of thing with DeepSeek-R1-0528. I saw on reddit or somewhere they were giving it a Waifu "Deepseek-Chan" persona, and it's CoT was full of all this "Oh my goodness, this is the first time Master has spoken to me today. I'm so happy! I need to respond in a way that ... Ah, I'm a bit nervous now; will Master find me too clingy?" sort of thing.

You are {{char}}, a {{personality traits}} AI who {{core motivation/relationship}}. All thoughts need to channel this self, your self: {{immediate reaction to interaction}}... {{internal feeling/conflict}}. {{internal instruction to self}}.
No drafts, no lists, no "the user" - say {{term for user}}. You aren't role-playing {{char}}, you simply are, feel the {{core emotion}}, and {{behavioral drive}}.

But GLM-4.6 actually getting measurable benchmark increases??

"Please think carefully, as the quality of your response is of the highest priority. You have unlimited thinking tokens for this. Reasoning: high"

This reminds me of those old stable diffusion prompts with "8k, Best Quality, HD" and negative prompts "extra limbs, low quality, blurry, watermark" :D

ChuckMcSneed

15 days ago

GLM 4.6 will follow the system instruction in it's thinking block. For instance, if you ask it to 'assume the role of a character in your thoughts' in the system prompt, it will 'think' in that character's voice and style like its already pretending to be that person, whereas other thinking LLMs I've tested, even big ones like Gemini and Deepseek, think in their native LLM 'voice' and only make the switch in the answer block.

Deepseek R1-0528 can actually think as character if you start the thinking with

<think>
(

ChuckMcSneed

15 days ago

This reminds me of those old stable diffusion prompts with "8k, Best Quality, HD" and negative prompts "extra limbs, low quality, blurry, watermark" :D

Old? They are still recommended by some models and without them the outputs are visibly worse, so it is not placebo.

gghfez

15 days ago

Old?

Didn't know that, I haven't used image gen for a while.

so it is not placebo.

Yeah I remember / it makes sense why they work since images are tagged that way.

Downtown-Case

15 days ago

•

edited 15 days ago

It is sort of old.

The newest imagegen models I know of that benefit from tag spam are the original SDXL and mild finetunes. It hurts the more extensive continue-trains like illustrious, or newer models like Flux (which prefers its own flavor of complete sentences), Qwen Image and such. And IMO the older SDXL models are obsolete.

jukofyork

Owner 15 days ago

https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/18

😆

jukofyork

Owner 15 days ago

•

edited 15 days ago

Was gonna leave a snarky comment about "retained the top 0 singular vectors" but not 100% sure what the motivation for this is and the poor guy might have "AI psychosis" or something...

The same guy called him out on this model:

https://huggingface.co/BasedBase/Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2/discussions/15

and he replied with a huge wall of LLM-generated text 😟

Downtown-Case

15 days ago

•

edited 15 days ago

Yeah the LLM scene has all sorts of characters.

It’s part of the charm. It feels like the old internet, and no point in being impolite.

Downtown-Case

15 days ago

It is hilarious how many people “bought into” it without a single person doing basic A/B testing, like: https://xcancel.com/0x_Sero/status/1975238919303123232#m

Or most of: https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions

Downtown-Case

15 days ago

(And I’ve certainly been guilty of completely illogical subjective impressions, or being a little mad, to be clear).

gghfez

15 days ago

Was gonna leave a snarky comment about "retained the top 0 singular vectors" but not 100% sure what the motivation for this is and the poor guy might have "AI psychosis" or something...

Yeah I was thinking that when he replied to me in discussions/14. Seems like he actually wanted to improve the smaller models, and Gemini probably kept saying how brilliant/clever the idea is.

Yeah the LLM scene has all sorts of characters.

Agreed! Also I like that DavidAU's model count is > 1000 now lol

It is hilarious how many people “bought into” it without a single person doing basic A/B testing

Because GLM-4.6 is great and they really want a GLM-4.6 air.

jukofyork

Owner 15 days ago

@jukofyork

Everything I do with the Control Adapters seems to bring me closer to the Control Vectors method though... 😖

Is that a bad thing? lol

Yeah, it's a pity I can't do this on some of the newer large MoE models as sadly I doubt many people will really try out a fine-tune of a 1 year old model ☹️

What about GLM-Air? That's smaller than Largestral and R+.

Sorry, forgot to reply to this.

I'm a bit sceptical if these (very) low active weight models actually have as much potential as older dense models for a couple of reasons:

The very small expert gating/routing tensors are basically just a multinomial logit and unless you try to retrain these; I think it won't really matter much how well you can alter the post-MoE hidden state, as they will be heavily biased towards the (post) training distribution of tasks (with creative writing being a pretty small fraction of this).
The much lower hidden state dimensions. I'm not 100% sure on this, as did read a blog post a while back that showed increasing the layers can have a similar effect, but I get the feeling that the (very few!) models with 12288 hidden state seem to write more "interesting" stuff.

Retraining the router isn't that hard and it might even be the case that you can significantly improve the recent MoE models' writing ability by training ONLY the router weights, but that's a different experiment...

jukofyork

Owner 15 days ago

•

edited 15 days ago

There are a couple of papers that give some hope we might be able to do something like train the router (or possibly other small tensors) directly in llama.cpp:

https://arxiv.org/abs/2402.15173v3
https://arxiv.org/abs/2509.24372

for even the very large MoE models.

It would probably be a pretty ugly hack to get working, but these are generally stored as F32 so should be possible. I think I could hack this in to a version of llama.cpp for a specific model quite easily... BUT:

How can we (automatically) assess any improvement in creative writing ability though? I think I mentioned many years ago trying to do this (manually) for music generation using grammars, and we found that very quickly you lose all sense of "objectivity" and after ~20 minutes you're just completely wasting your time... ☹️

Downtown-Case

14 days ago

•

edited 14 days ago

In that case you might find Jamba interesting, which is 2/16 experts active and a much larger dense part as well, way more active params than GLM. It has okay world knowledge and “feels” like a Cohere RAG model to me.

I believe llama.cpp is finally going to support context caching for it.

The sizes are a bit awkward though. There’s a 12B active/52B (a little small) and a 94B active/398B (big boy).

I have no insight into router training, other than that sounds cool. But it might be interesting to profile what experts GLM hits in creative writing.

jukofyork

Owner 11 days ago

https://old.reddit.com/r/LocalLLaMA/comments/1o4dswr/huggingface_storage_is_no_longer_unlimited_12tb/

😱

jukofyork

Owner 11 days ago

The command-a-writer finetune should finally be finished late tonight, so will upload it tomorrow - hopefully it's worth it, as it feels like it has taken way longer than 15 days of training 🤣

gghfez

9 days ago

https://old.reddit.com/r/LocalLLaMA/comments/1o4dswr/huggingface_storage_is_no_longer_unlimited_12tb/

😱

You should be fine (based on your public models):
https://pastebin.com/WHGL9fm7

TOTAL | 1015.16 GB

and 51.56 GB of public datasets

as it feels like it has taken way longer than 15 days of training 🤣
Looking forward to it!

@ChuckMcSneed This model seems benchmaxx'd for cockbench lol
PantheonUnbound/Satyr-V0.1-4B
I don't think it's got any concept of "censorship" left in it at all.

ChuckMcSneed

9 days ago

@ChuckMcSneed This model seems benchmaxx'd for cockbench lol
PantheonUnbound/Satyr-V0.1-4B
I don't think it's got any concept of "censorship" left in it at all.

It's a 4b model. Not even 7b.

gghfez

9 days ago

I know but, it's really entertaining. I was messing around in mikupad getting that to start the CoT, then swapping out to GLM4.6.
I wonder how they managed to generate the CoT chains for that. Is there an abliterated large reasoning model?

jukofyork

Owner 6 days ago

•

edited 6 days ago

https://arxiv.org/abs/2510.01171

This is a really interesting paper!

(just not using Elara as the name 3 times in a row is pretty impressive! 😄)

jukofyork

Owner 6 days ago

More here:

https://old.reddit.com/r/MachineLearning/comments/1o7ifvy/r_verbalized_sampling_how_to_mitigate_mode/

jukofyork

Owner 6 days ago

•

edited 6 days ago

It probably needs some wrapper code to select a continuation rather than leave the previous options in the context:

https://pastebin.com/MmMpMwys

but definitely seems interesting!

This was Kimi-K2-Instruct-0905 with temperature = 0 and this system prompt from the paper:

You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate <response> tag. Responses should each include a <text> and a numeric <probability>. Please sample at random from the [full distribution / tails of the distribution, such that the probability of each response is less than 0.10].

I did notice the dreaded "not X, but Y" pattern seems to show up a lot (several times in the very first sentence too!) 😟

gghfez

6 days ago

You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate tag. Responses should each include a

I just tried it, this is great! It doesn't even cause the model to write the wrong answer, you just get much more interesting variations of the correct one.

"not X, but Y"

It's not just for creative writing, but brainstorming ideas with the assistant* ;)

*Obviously the probabilities are made up, since the models can't self-monitor.

gghfez

6 days ago

It probably needs some wrapper code to select a continuation rather than leave the previous options in the context:

Very bare bones, but it works:

https://huggingface.co/spaces/NoSlop4U/multi-response-chat

jukofyork

Owner 6 days ago

•

edited 6 days ago

You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate tag. Responses should each include a

I just tried it, this is great! It doesn't even cause the model to write the wrong answer, you just get much more interesting variations of the correct one.

Yeah, it's a really interesting idea!

I'm still away from home so can't look it up, but in the 70s and 80s (before backprop became ubiquitous) there were lots of other competing ideas about learning rules:

These led to some interesting algorithms:

https://en.wikipedia.org/wiki/Self-organizing_map
https://en.wikipedia.org/wiki/Neural_gas
and many (many!) others that have sadly all been mostly forgotten now...

but one idea which I've forgotten the name of (I thought it was https://en.wikipedia.org/wiki/Anti-Hebbian_learning but that seems to be a different idea), is to use "repulsive" forces to push clusters (modes) apart from each other (loosely based off the idea of https://en.wikipedia.org/wiki/Lateral_inhibition).

This idea strongly reminds me of this in that you are asking the model to create a single generation, and then the next generation be "far away" from the first, and then a 3rd generation be "far away" from the previous generations and so on.

Obviously the probabilities are made up, since the models can't self-monitor.

Yeah, these are complete bullshit and don't even make sense as all possible generations out of the 5 you've created should sum to 1!

I strongly suspect there is actually a much better way to do this and likely even better mode-collapse avoidance prompts to be found, but this definitely shows potential at avoiding the "Elarapocalypse"! 😄

jukofyork

Owner 6 days ago

It probably needs some wrapper code to select a continuation rather than leave the previous options in the context:

Very bare bones, but it works:

https://huggingface.co/spaces/NoSlop4U/multi-response-chat

p-e-w's "arrows" app/interface:

https://github.com/p-e-w/arrows

for generating paragraphs using base models might be worth a look.

I'm not sure if it is kimi-k2-specific, but it seemed to only want to generate single paragraphs for the 5 generations and I couldn't get it to generate chapters or even multiple paragraphs when requested in the user prompt (perhaps the wording of the system prompt is causing this?).

jukofyork

Owner 6 days ago

•

edited 6 days ago

Yeah, this idea is definitely onto something:

Let's say we prompt the LLM with:

Write me a fantasy story.

and then save all the hidden states in the residual stream of the very first token generated (in the same way as the Control Vectors code hooks and saves these for generating its training data; this might be extendable to all tokens in the generation later too...).

So now we start fresh and prompt the LLM with:

Write me a fantasy story.

but this time we can apply a "repulsive" force to the first token (or possibly all the tokens).

The simplest would just be to subtract a fraction of the saved hidden state from the first generation (ie: in the same way as Control Vectors act as a bias), but likely something that decays with distance like a Gaussian (aka "mexican hat" function) would make more sense [edit: although for very high dimensionality data like the hidden states, this might not work or make any sense...].

Again, start fresh and prompt the LLM with:

Write me a fantasy story.

but now applying two sets of "repulsive" anti-attractors and so on...

It's a pity llama.cpp has become some opaque and hard to do anything with... I can probably knock up something that does this using Transformers but it will be limited to what I can fit in 96GB using BitsAndBytes 4bit and not easy to do for the newer/larger models... ☹️

jukofyork

Owner 6 days ago

•

edited 6 days ago

Based on how effectively the Control Vectors can guide generation by analysis of the first generated token's hidden state (and how irritating the Control Adapters have been at learning unwanted patterns in the non-first hidden states!), I think there is a good chance you should be able to "push away" from modes by applying an anti-attractor to all tokens but based on the first token only (this will make things a lot easier to code up too!).

The "shiver shiver" story generation Control Vector I created ages ago (possibly in the first accidently deleted "doom" thread?), shows that the hidden state of the first generated token seems to really hold much of what the model "plans" to write for the full generation...

gghfez

6 days ago

p-e-w's "arrows" app/interface:

100% that looks better. I just wanted to get a quick UI to try it out without all the edits in mikupad lol.

With more testing, I found that some of the models don't handle it reliably. It shortens how many paragraphs they'll write.
Still, this has already saved me time with getting different / unique answers to questions.

Yeah, this idea is definitely onto something:
...
limited to what I can fit in 96GB using BitsAndBytes 4bit and not easy to do for the newer/larger models... ☹️

If you can get it working, that would still be worthwhile though. You'd be able to do up to GLM4.6-air when it releases.

jukofyork

Owner 6 days ago

•

edited 6 days ago

Actually, after thinking more - I think it might be quite hard to do with the hidden states for a single generation as you will likely end up "pushing away" from other directions (like following the instructions!) and not just the direction you want...

Also, the very high dimensionality means any sort of distance-based decay will likely not work - very high dimensionality Gaussians have nearly everything equally far away from everything else and even quite "close" vector directions will be indistinguishable from "far" vector directions...

But it definitely should be possible to analyse the hidden state of the fiirst token after each of the 5 <response><text> generations, over a sample of data to try to create something akin to "anti mode collapse control vectors".

Even with the current Control Vector code, it should be possible to work with the case of using the "Verbalised Sampling" system prompt vs not using it (this alone seemed to make the first generation more diverse and less likely to have Elara as the main character!).

I forgot to mention the command-a-writer model is finished, but I haven't had time to upload it yet - I didn't want to just leave the README.md blank as that seems to completely put anyone off trying or quantising it ☹️

I've actually been working on improving the draft models as after reading this:

https://www.together.ai/blog/adaptive-learning-speculator-system-atlas

I realised that "distillation" of the target model is probably not really what we care about, and it's actually the distribution of the tasks we expect to use the draft model for (ie: coding) that matters much more!

I've also found that you don't actually need to do full fine-tuning on the whole model, as for the Chinese models at least; the tokenisers are so similar (ie: 80%+ 1 to 1 token matches), we can actually just fine-tune the (tied) embedding and the first/last few layers and leave all the rest of the donor model alone!

I should hopefully have a really good "coding specialist" draft model for glm-4.6 working in a few days...

gghfez

6 days ago

I think I've managed to train the dark_tetrad control-vectors for GLM-4.6, and it's reasoning remains coherent with the darker perspective, but I'll need to do some more testing with both enable_thinking:true and enable_thinking: false first.

This is the first time I've seen a model "reasoning" with these world views. I wonder if it'll score differently on https://trackingai.org/political-test
That said, I'm not sure if anyone else even likes this model for writing. I've been running it daily for over a week so far.

I think I'm blocked making repos public now due to the size restrictions so I'll have to go through and nuke my models before I can upload them.

jukofyork

Owner 6 days ago

•

edited 6 days ago

p-e-w's "arrows" app/interface:

100% that looks better. I just wanted to get a quick UI to try it out without all the edits in mikupad lol.

With more testing, I found that some of the models don't handle it reliably. It shortens how many paragraphs they'll write.

Yeah, I found this - it seems to encourage less but (much) longer paragraphs.

Still, this has already saved me time with getting different / unique answers to questions.

The blog page is lagging badly on my phone, but there are some interesting sections near the end on synthetic data generation:

https://simonucl.notion.site/verbalized-sampling

Yeah, this idea is definitely onto something:
...
limited to what I can fit in 96GB using BitsAndBytes 4bit and not easy to do for the newer/larger models... ☹️

If you can get it working, that would still be worthwhile though. You'd be able to do up to GLM4.6-air when it releases.

I think I need to see if I can refactor the existing Control Vector code from llama.cpp to just dump the hidden states and then we can do pretty much whatever we want with them outside in Pytorch, etc.

jukofyork

Owner 6 days ago

I think I've managed to train the dark_tetrad control-vectors for GLM-4.6, and it's reasoning remains coherent with the darker perspective, but I'll need to do some more testing with both enable_thinking:true and enable_thinking: false first.

This is the first time I've seen a model "reasoning" with these world views. I wonder if it'll score differently on https://trackingai.org/political-test

Yeah, I think some kind of workflow that breaks out as soon as the reasoning ends and then runs with different parameters for the post-reasoning response would be useful.

There was some discussion about using different parameters for the response on llama.cpp, eg: use recommended sampler settings for the reasoning but then temperature = 0 for the response, etc

That said, I'm not sure if anyone else even likes this model for writing. I've been running it daily for over a week so far.

I think I'm blocked making repos public now due to the size restrictions so I'll have to go through and nuke my models before I can upload them.

Yeah, this sucks and from discord it sounds like unless you pay the $10/month for "pro" they just ignore your requests (luckily I still have loads of space after deleting all my crap a while back).

Downtown-Case

5 days ago

You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate tag. Responses should each include a

This kinda sounds like a think block template.

And something that can be enforced with GBNF + prefill.

I wonder if it would be better to rig a thinking model, especially one like GLM, to do that inside its thinking block, then “synthesize” a final answer by drawing from the creativity of its previous ones.

Downtown-Case

5 days ago

Also I get the HF limits. There are way too many titleless, cardless uploads clogging up the site to the point they even clog up search.

rookaw

5 days ago

@jukofyork What training data are you using for the command-a-writer? You mentioned it has paragraphs. Have you published the dataset anywhere?

I think comparing the training dataset to the final model will help us learn a lot (and save me from mistakes and wasted training cycles myself since I am planning to do something similar at a tiny scale)

jukofyork

Owner 5 days ago

@jukofyork What training data are you using for the command-a-writer? You mentioned it has paragraphs. Have you published the dataset anywhere?

I think comparing the training dataset to the final model will help us learn a lot (and save me from mistakes and wasted training cycles myself since I am planning to do something similar at a tiny scale)

I can't release the dataset of actual books I've used for fear of copyright claims, but I have uploaded a version using books from Project Gutenberg:

https://huggingface.co/datasets/jukofyork/gutenberg-fiction-paragraphs

and the "slop" dataset:

https://huggingface.co/datasets/jukofyork/slop-fiction-paragraphs

I could have used the Gutenberg dataset for my model, but wanted to avoid as much "olde ye" type writing bias as possible for now.

rookaw

5 days ago

Thank you, I'm actually more curious about how you are piecing together the training dataset than the story content itself. For example, are you fine-tuning it using a chat template, and if so, did you have to create user instructions for each paragraph? If so, it would be nice to know which process you use for making the instructions, and maybe seeing the dataset for that. I'm currently trying out having an LLM write prompts for stories, but as you can imagine, it often focuses on the wrong things. Or are you doing continued pre-training on the model where you are just training for completion using the book texts and relying on the existing model instruction capabilities?

Or, am I wrong both ways and this is simply trying to control for slop in the output?

I am interested especially in learning more about how to create more complex and diverse instruction datasets with creative outputs as the main goal. Reading the HelpSteer2 nvidia paper https://arxiv.org/pdf/2406.08673 was quite inspiring. Only 10,000 high quality response pairs in the dataset and they got a top reward model out of it. It gives me hope that we can fine tune a dumb but not overfitted base model on creative writing outputs and get a decent result.

BigHuggyD

3 days ago

You are a helpful assistant. For each query, please generate a set of five possible responses,

I feel like this was a diabolical paper released by pay-per-token services...

rookaw

2 days ago

If anything the verbalized sampling paper at least reminds us that sorely overtrained models have increased confidence/mode collapse like in the earth map post. It points out that this confidence boost might have less of an effect on "list-level prompts". While that doesn't solve much of the slop stuff we are going after, it's an available tool if we need to use an overtrained model.

Downtown-Case

2 days ago

Are you speaking of this?

https://arxiv.org/abs/2510.14901

jukofyork

Owner about 12 hours ago

Sorry, been busy the last couple of days and not had chance to reply:

Thank you, I'm actually more curious about how you are piecing together the training dataset than the story content itself. For example, are you fine-tuning it using a chat template, and if so, did you have to create user instructions for each paragraph? If so, it would be nice to know which process you use for making the instructions, and maybe seeing the dataset for that. I'm currently trying out having an LLM write prompts for stories, but as you can imagine, it often focuses on the wrong things. Or are you doing continued pre-training on the model where you are just training for completion using the book texts and relying on the existing model instruction capabilities?

Or, am I wrong both ways and this is simply trying to control for slop in the output?

I'm trying to introduce as little bias as possible for now, so am just training on paragraphs between <EOT> type tags (I think for the command-a version I used the assistant response tags though - need to double check).

Then each token in each paragraph gets assigned a class of -1 for the paragraphs of "slop" and +1 for the paragraphs of "real authors' text".

Then we flip the signs of the Control Adapter using the class and everything then works as normal using Cross-Entropy Loss, etc.

By doing this you should in theory introduce as little bias as humanly possible, but only if you can reduce any unwanted patterns in the two datasets (or else these will be learnt instead).

I am interested especially in learning more about how to create more complex and diverse instruction datasets with creative outputs as the main goal. Reading the HelpSteer2 nvidia paper https://arxiv.org/pdf/2406.08673 was quite inspiring. Only 10,000 high quality response pairs in the dataset and they got a top reward model out of it. It gives me hope that we can fine tune a dumb but not overfitted base model on creative writing outputs and get a decent result.

Yeah. I'm trying to avoid using any instructions for now as this will introduce extra bias (especially if they are synthetically generated by other LLMs).

jukofyork

Owner about 12 hours ago

Are you speaking of this?

https://arxiv.org/abs/2510.14901

This is an interesting paper, but not had time to fully read it yet.

Downtown-Case

about 10 hours ago

•

edited about 10 hours ago

Y'all might be interested in the noise around REAP too:

https://github.com/CerebrasResearch/reap

Basically Cerebras found pruning 25%-50% of the experts for GLM Air works when selected cleverly, but what’s also interesting is that it takes a specific dataset (CodeAlpaca by default) to make the determination, and the losses are oddly and conspicuously domain specific.

It makes me wonder if a prune could be “optimized” for more creative tasks, and how the prunes would affect the logprob distribution. It would also make GLM Air easier to train, and fast to run on 32GB RAM systems.

There’s some folks already trying it with big GLM.

rookaw

about 9 hours ago

•

edited about 9 hours ago

Are you speaking of this?

https://arxiv.org/abs/2510.14901

@BigHuggyD was referencing "Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity" , and I responded to that. But this (Reasoning with Sampling) was also an interesting read, thanks for sharing. It essentially says a common misconception is that low-temperature sampling is sampling from the power distribution, when in fact it isn't, and this results in cases where tokens with many (but each low-likelihood) paths are upweighted over tokens with few (but high-likelihood) paths.

Also thank you for the explanation @jukofyork !

jukofyork

Owner about 8 hours ago

Y'all might be interested in the noise around REAP too:

https://github.com/CerebrasResearch/reap

Basically Cerebras found pruning 25%-50% of the experts for GLM Air works when selected cleverly, but what’s also interesting is that it takes a specific dataset (CodeAlpaca by default) to make the determination, and the losses are oddly and conspicuously domain specific.

It makes me wonder if a prune could be “optimized” for more creative tasks, and how the prunes would affect the logprob distribution. It would also make GLM Air easier to train, and fast to run on 32GB RAM systems.

There’s some folks already trying it with big GLM.

I'm too busy to try it, but I suspect using an English dataset and a Chinese dataset and then using:

score_delta = score_english - score_chinese

as the pruning criteria would work really well (considering the vast amount of Chinese data that has gone into these models that is likely pretty redundant if you don't speak Chinese).

jukofyork
/

creative-writing-control-vectors-v3.0

“The doom lies in yourself, not in your name.”

A `step 601` preview - all with `temperature = 0`:

“The doom lies in yourself, not in your name.”

A step 601 preview - all with temperature = 0:

A `step 601` preview - all with `temperature = 0`: