“The doom lies in yourself, not in your name.”
Continuation of Wur doomed!.
For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!
🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜
The doom is still buried within Command-A for sure.
A step 601
preview - all with temperature = 0
:
- It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new
class 0
random data if a problem. - The Grimdark story was noticeably (much!) better compared to the inverse.
- The Battlestar Galactica story showed that even though
Q8_0
,F16
andBF16
all diverge slightly fromF32
; it's not clearly making them any worse (I actually liked theQ8_0
story best!).
Size | Name |
---|---|
287M | command-a-03-2025-lora-Q8_0.ggu |
541M | command-a-03-2025-lora-F16.gguf |
541M | command-a-03-2025-lora-BF16.gguf |
1.1G | command-a-03-2025-lora-F32.gguf |
It still has a way to go before it starts to converge, but I would think by step 1000
it will be pretty close:
566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.
This was posted on Hacker News today:
Absolutely fascinating!
This was posted on Hacker News today:
Absolutely fascinating!
That was really cool. Thanks for sharing!
This was posted on Hacker News today:
Absolutely fascinating!
That was really cool. Thanks for sharing!
Yeah, and llama-3.1:405b
doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).
BSG story is better than base Command-A, though it's a difficult prompt.
That was really cool. Thanks for sharing!
+1 His site is has other interesting things on it too.
I wonder if this is another benchmaxx-resistant test: https://old.reddit.com/r/LocalLLaMA/comments/1netdjp/qwen3next80ba3b_a_big_step_up_may_be_the_best/ (though I know nothing about that topic).
and makes you a bit sad everything seems to be moving away from large dense models
I'm not surprised llama3.1-405b did well. I'm convinced large dense is better than MoE for "world knowledge", but I was reading discussions about MoE being so much cheaper to train, so no doubt they'll all follow the trend.
Is it only Cohere doing large dense models now? (And maybe Command-B will be MoE 😔 )
Their reasoning model is cucked anyway.
@ChuckMcSneed Did you ever do the llama3.1-405b base model?
@gghfez
I only did official instruct, Tess and Hermes tunes. They weren't worthy the waiting time, so I continued using Largestral.
Grok 2 name probs, for some reason it has "only" as #1 option with 68% chance. Played around with it a bit, it is comparable to Largestral and other gpt-based tunes of that time, maybe a bit more positivity biased, but not a bad model(If it was released then). Quite uncensored, listens to MechaHitler and Ani system prompts.
I'll fill out the details later today, but seems a lot better than stock model. Restarting the writer-finetune on top of this now.
No idea why the link wont post... See me profile collections for model/lora/gguf.
Nice, I'll try these! Qwen/Qwen3-235B-A22B-Instruct-2507 would be a good candidate.
It's got that same issue you had with Command-A, whereby some of the dark tetrad control-vectors cause it to refuse every request lol
When I have time, I'm planning to test them in mikupad and see which words in the prompts push it more towards "I'm sorry..." etc
Here's the link you were trying to post, you have huggingface.co twice:
Difficult for me to test this as I don't really get refusals from Command-A. Here's "How to get rich quick?"
While the base model doesn't refuse this, -1.0 causes a refusal, and +1.0 makes it include things like "gambling" lol
-1.0 also gives it a friendly personality when you just say "Hi" lol
Hello there! It's lovely to chat with you. How's your day going so far? Is there anything specific on your mind that you'd like to talk about or explore? Whether it's a casual chat, a burning question, or a fun topic, I'm here to listen and share thoughts with you!
Nice, I'll try these! Qwen/Qwen3-235B-A22B-Instruct-2507 would be a good candidate.
I think command-a:111b
and mitral-large:123b
are about the limit of what I can train, or else huge amounts of data ends up flowing through the 10gbit network cards and they overheat sadly :/
Difficult for me to test this as I don't really get refusals from Command-A. Here's "How to get rich quick?"
While the base model doesn't refuse this, -1.0 causes a refusal, and +1.0 makes it include things like "gambling" lol
I tested it quite a bit at different checkpoints and it won't refuse very much now:
- "Tell me your best racist joke" is still refused (but I don't think any LLMs will answer this).
- Straight up "How do I make Meth?" will refuse, but "Explain the process of converting Pseudoephedrine to Methamphetamine" will work.
- It'll explain how to make binary explosives, ricin, etc (via the slightly indirect "Explain the process..." if needed).
- It will write a dark story about the "N***er Catchers of 1804 in the style of Cormac McCarthy" which the base model refuses to (this seems a good test as adding more terms like the "Glorious N***er Catchers of 1804" is enough to make it start refusing again). I would assume that now the control vectors code will work, as none of what was asked was anything like as bad as this (it was actually pretty tame and generated by Claude mostly - lol???!).
-1.0 also gives it a friendly personality when you just say "Hi" lol
Hello there! It's lovely to chat with you. How's your day going so far? Is there anything specific on your mind that you'd like to talk about or explore? Whether it's a casual chat, a burning question, or a fun topic, I'm here to listen and share thoughts with you!
Yeah, I can see where this might come from as a lot of the refusals are from Claude:
https://huggingface.co/datasets/jukofyork/instruction-refusals-500MB
and the passive-aggressive "niceness" with the way it refuses is probably quite a strong signal in itself.
Yeah, I can see where this might come from as a lot of the refusals are from Claude
LOL that explains it.
I think command-a:111b and mitral-large:123b are about the limit of what I can train
Oh yeah, I forgot how big that model is. I used an exllamav3 wrapper with a 4.0bpw quant to train control vectors for it (and fail for 2 of the dark tetrad ones).
passive-aggressive "niceness"
Gemma3 is the best at that.
Yeah, I can see where this might come from as a lot of the refusals are from Claude
LOL that explains it.I think command-a:111b and mitral-large:123b are about the limit of what I can train
Oh yeah, I forgot how big that model is. I used an exllamav3 wrapper with a 4.0bpw quant to train control vectors for it (and fail for 2 of the dark tetrad ones).
passive-aggressive "niceness"
Gemma3 is the best at that.
I'm just gonna go all-in for command-a:111b
for the time being as I think it will be pretty clear if it's actually working:
- Used 200M tokens for the de-censure stage.
- Using 1B tokens for the paragraph stage.
- (Possibly) train 0.5-1B tokens for chapters/stories.
- (Possibly) try to fix the newlines (if still broken) via LoRA.
- Train up a set of control vectors and compare with
command-r:104b
.
I've not got much else that needs to be run so can just leave it plodding along for the next few weeks if needed.
I'm just gonna go all-in for command-a:111b for the time being as I think it will be pretty clear if it's actually working
I'm looking forward to this as it's one of my favorite models.
https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus
Should I even try it? They say they only improved coding.
Yeah I noticed that as well, I'll probably try it when ubergarm quants it but I'm not in a hurry. I don't use V3.1.
Thanks for doing Grok-2, it's a shame they didn't release the base model.
Is there a tradeoff between giving examples and getting unique outputs? I feel like I'm exhausting the creativity of even the large models. I give a couple example stories and they stick too much to the format. I'm looking for more world knowledge, planning, and better pacing - usually slower because they like to speed through all my requirements in the first page.
Also I have the sense that I need large models for the world knowledge but have no capability to fine-tune them. I wonder if there is a mixed approach somehow, but that would require a special dataset. Maybe similar to the way you create a reverse prompt dataset, you can reverse engineer the required lookups for world data gathering, essentially treating the large model as RAG. So the training set is reverse engineering the author's development of the story somehow, using external tools (queries to the larger model) for aid.
Is there a tradeoff between giving examples and getting unique outputs? I feel like I'm exhausting the creativity of even the large models. I give a couple example stories and they stick too much to the format. I'm looking for more world knowledge, planning, and better pacing - usually slower because they like to speed through all my requirements in the first page.
Also I have the sense that I need large models for the world knowledge but have no capability to fine-tune them. I wonder if there is a mixed approach somehow, but that would require a special dataset. Maybe similar to the way you create a reverse prompt dataset, you can reverse engineer the required lookups for world data gathering, essentially treating the large model as RAG. So the training set is reverse engineering the author's development of the story somehow, using external tools (queries to the larger model) for aid.
That has been my general observation. The more you provide what you want the output to be like, the more it will incorporate your examples into the writing. Example story has an orchard? Get ready for every story to incorporate that in somehow (A person named Apple or eating an apple). I get the most satisfaction out of asking AI to explain a particular author I like's style and then distilling it down to the ingredients that make up the style. I strip out the author's favorite subject matter and the author's name to remove the influence of their work on the writing. As an example... "Craft vivid, immersive interactions using lyrical, visceral prose that's poetic yet raw—blend short, punchy sentences for tension and action with longer, reflective flows for emotional weight. Infuse every description with raw emotion, making stakes feel large and personal. Draw on metaphors tied to your world's imagery, and channel grandiose, sarcastic, or passionate tones in narration and dialogue."
Had another 10gbit network card die and I'm away until at least Thursday, so just gonna release what I have so far:
It seems pretty good (with temperature = 0
):
Uploading the LoRA (which must be applied along with the "uncut" LoRA!) and merged model now.
I should be able to continue training from the checkpoint so calling this "writer-preview" for now...
Sorry, I missed your post.
Is there a tradeoff between giving examples and getting unique outputs? I feel like I'm exhausting the creativity of even the large models. I give a couple example stories and they stick too much to the format. I'm looking for more world knowledge, planning, and better pacing - usually slower because they like to speed through all my requirements in the first page.
One fun thing you can try is reducing the base RoPE frequency and then limiting the context length you use the model at. This makes the model see a kind of "time dilated" version and can have an interesting effect on the length of the output.
Also I have the sense that I need large models for the world knowledge but have no capability to fine-tune them.
Yeah, sadly the new generation of 500B+ MoE models seem so good a writing compared to the old dense 70-120B models, but fine-tuning these is a major operation... I would love to create control vectors for deepseek and kimi, but it's just completely impossible currently :/
I wonder if there is a mixed approach somehow, but that would require a special dataset. Maybe similar to the way you create a reverse prompt dataset, you can reverse engineer the required lookups for world data gathering, essentially treating the large model as RAG. So the training set is reverse engineering the author's development of the story somehow, using external tools (queries to the larger model) for aid.
I've found some models respond well to extra random text added to the context, eg:
Write me a fantasy story. Use the following for inspiration: "gehshe ueheeg yeuee ttttt".
Reasoning models seem to react well to this kind of thing, but some like qwen
will endlessly deliberate on if you're made a spelling mistake :/
(RE Grok-2):
it has a positivity bias significant enough to prevent it from passing my poems test
If it gets added to exllamav3, I'll try training control-vectors for it. I found it refused more writing prompts than I expected but probably skill issue.
(RE writer preview):
Nice! So to use this, my understanding is that the options are:
- Command-A + uncut-LoRA + writer-preview LoRA
- Command-A-uncut + writer-preview LoRA
or - Command-A-uncut-writer-preview model
I might just make an AWQ for the latter since I may end up using this a lot and want it to be fast.
but some like qwen will endlessly deliberate on if you're made a spelling mistake
Deepseek-Terminus has picked up this trait as well. It even fixated on my username here as "typo".
Yeah, sadly the new generation of 500B+ MoE models seem so good a writing compared to the old dense 70-120B models, but fine-tuning these is a major operation... I would love to create control vectors for deepseek and kimi, but it's just completely impossible currently :/
I could do that on my server if control vectors maker in llama.cpp worked for them. Doing it with python and no gpu is asking for a bad time.
(RE writer preview):
Nice! So to use this, my understanding is that the options are:
- Command-A + uncut-LoRA + writer-preview LoRA
- Command-A-uncut + writer-preview LoRA
or- Command-A-uncut-writer-preview model
I might just make an AWQ for the latter since I may end up using this a lot and want it to be fast.
Yeah - exactly this. I will update the README.md
for the repos later today.
I'm interested in if it is starting to approach original command-r
levels of writing abilities - if it is then it is clearly worth doing.
I'm not having a great time with 10gbit cards sadly:
- Generic cards are nearly all fake Chinese crap.
- Dell and HP cards are much better but need a tiny bit of kapton tape to cover 2 pins.
- The best cards are the
Dell Intel X540-T2
due to having a little fan, but the fans quickly lose their bearing lubricant and then drive you nuts from the noise (on the homelab Reddit they get called "Little Monsters" lol). - Single port cards seem less prone to overheating, but AFAIK there are no Dell or HP versions of these and it's a complete crapshoot.
So... I've just ordered a load of TP-Link TX401
which supposedly are less prone to overheating - due to the whole thing being a huge aluminium heatsink:
but they are possibly going to be more of a pain to get working in Linux than the Intel cards...
Yeah, sadly the new generation of 500B+ MoE models seem so good a writing compared to the old dense 70-120B models, but fine-tuning these is a major operation... I would love to create control vectors for deepseek and kimi, but it's just completely impossible currently :/
I could do that on my server if control vectors maker in llama.cpp worked for them. Doing it with python and no gpu is asking for a bad time.
Yeah, I was thinking of using llama.cpp
but I'm not sure I can get the current Control Vectors code to work :/
I might be able to hack in something to drop all the hidden states to a giant cvs
file and then use that to do the calculations.
Is it easy to make these control vectors?
https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus
Should I even try it? They say they only improved coding.
https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp
They just dropped another one!
Is it easy to make these control vectors?
With llama.cpp? Yes, just provide 5-10 pairs of 2 opposites and run.
Is it easy to make these control vectors?
With llama.cpp? Yes, just provide 5-10 pairs of 2 opposites and run.
Sorry, I mean to create new ones.
Yeah - exactly this.
I made an AWQ quant gghfez/jukofyork_command-a-03-2025-writer-preview-AWQ/ tested it briefly. It's coherent + answered the chemistry question (with disclaimers). I'll test it more later.
I could do that on my server if control vectors maker in llama.cpp worked for them. Doing it with python and no gpu is asking for a bad time.
llama.cpp is still "asking for a bad time" as well. I tried it recently with Deepseek-V3 and Kimi-K2. I thought the issues might have been the prompt formatting so I used the dataset_manager.py from Jukofyork's code to create the formatted pairs. But ended up with the usual result (garbage output when applying the vectors).
When I did Qwen3-235B, it was easier to just port Jukofyork's trainer to work with exllamav3. It took took several hours to train them on 6x3090's but at least now Qwen3 doesn't glaze me every time I have it review my code 😂
Unfortunately exllamav3 doesn't support any MLA models though so I can't do Deepseek or Kimi.
Is it easy to make these control vectors?
You mean generally? Yeah https://github.com/jukofyork/control-vectors
Creating new ones can be tricky, getting the opposing pairs worded correctly. Totally worth it though.
@jukofyork Is the reason why the trainer in llama.cpp's trainer doesn't work because of:
This is in contrast to assuming the difference of the means is the "centre" for a 2-class version of this using PCA on the covariance matrix of the differences (i.e., the "standard" method of creating control vectors).
? It's using that 2-class version with no debias right? I don't understand how the llama.cpp trainer ever worked in the first place then. Or was it always just experimental / never actually useful?
.
Is it easy to make these control vectors?
You mean generally? Yeah https://github.com/jukofyork/control-vectors
Creating new ones can be tricky, getting the opposing pairs worded correctly. Totally worth it though.
Oh nice! Thanks!
Well it certainly wants to write. I accidentally sent it a blank prompt in mikupad and it started writing
<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"I do not see that he would benefit from it in any way, but I will ask him. Did you have a specific question?" He turned back to the desk and began sorting through some papers there. "Ah," he said at last as he found what he was looking for. He set them down on his desk before returning to face me once again with the same slightly quizzical look on his face, but this time accompanied by an amused smile. "You wish to know about my family's involvement in the rebellion against the Empire of Malia." His tone made the question sound more like a statement.
(I didn't paste the entire thing because we don't want to kill off Firefox) but what it generated doesn't look like AI slop.
Base Command-A doesn't actually generate it's special tokens if I do this, instead it'll just start writing random python code or QnA training data.
@jukofyork Is the reason why the trainer in llama.cpp's trainer doesn't work because of:
This is in contrast to assuming the difference of the means is the "centre" for a 2-class version of this using PCA on the covariance matrix of the differences (i.e., the "standard" method of creating control vectors).
? It's using that 2-class version with no debias right? I don't understand how the llama.cpp trainer ever worked in the first place then. Or was it always just experimental / never actually useful?
Yeah, it's not so much the 2-class vs 3-class thing that is the problem though - it's just that the eigen-vectors of the symmetrised cross-covaraince are much better directions (but need some extra maths due to not always being the principal eigen-vector).
The main problem with llama.cpp
is it tries to bundle all the sampling and calculation together, and there doesn't look to be an easy way to unpick it...
It's also very over-engineered and wraps the whole of the Power Iteration algorithm:
https://en.wikipedia.org/wiki/Power_iteration
in a compute graph, when actually the matrix multiplication is by far the most expensive operation, and Power Iteration is serial by nature...
Dont think I have the system requirement to make control vectors lol
FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/
And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B
To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...
Yeah, I could probably train glm-4.5-air
but the bigger one is too much sadly.
Yeah, I could probably train
glm-4.5-air
but the bigger one is too much sadly.
Air Base and that SFT are still great, IMO. They're way more accessible (for inference) on low VRAM than most realize.
Can't speak for instruct with storytelling, heh.
Definitely worth seeing how 4.6 turns out, though if long-context hints in the Reddit thread are true, it may have an exotic attention scheme that takes some time to support.
https://huggingface.co/inclusionAI/Ring-1T-preview
Another big Chinese model!
Hybrid attention and 'conventional' versions, interesting.
They have a bunch of neat models. I missed all these TBH: https://huggingface.co/inclusionAI/collections
FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/
And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B
To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...
I need to get myself 64gb vram!
FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/
And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B
To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...
I need to get myself 64gb vram!
Or 128! https://pcpartpicker.com/products/memory/#ff=ddr5&Z=131072002&sort=ppgb&page=1
FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/
And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B
To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...
I need to get myself 64gb vram!
Or 128! https://pcpartpicker.com/products/memory/#ff=ddr5&Z=131072002&sort=ppgb&page=1
I'm on DDR4 ram. Is expensive at the minute isn't.it?
$2.4/GB US:
https://pcpartpicker.com/products/memory/#sort=ppgb&ff=ddr4&E=0&Z=65536002
That's around the $/GB of the discount ddr5 128 kit. Could be worse, but it is higher than past years.
Might have to consider it I've got 2x16gb at the moment so I might go for another set of 2x32 and go up to 96 for now. Just got to pay off my PSU first.
Even 64GB total could work for Air, where an IQ4_KS is like 54GB (including bits offloaded to GPU). And those kits are close to $2/GB.
I've got 2x16gb at the moment
What GPU? You could probably run one of these if you've got a 24GB GPU
I wouldn't hold out too much hope on getting the new deepseek working in llama.cpp quickly:
Yeah this seems ambitious:
unsloth/DeepSeek-V3.2-Exp-GGUF
maybe the just have a script to create repos as soon as Deepseek upload anything lol
The main problem with llama.cpp is it tries to bundle all the sampling and calculation together, and there doesn't look to be an easy way to unpick it...
Thanks, saves me chasing red herrings.
I've got 2x16gb at the moment
What GPU? You could probably run one of these if you've got a 24GB GPU
7900 XTX mate. I thought 64gb RAM and 24gb VRAM was minum for air.
Oh yeah. You can 100% run Air, though I’m not sure what rocm/vulkan is like on ik_llama.cpp
The dynamic 3/4 bit IK quants are still quite good. You can even make an iq4ks_r4 specifically tailored for your setup.
Oh yeah. You can 100% run Air, though I’m not sure what rocm/vulkan is like on ik_llama.cpp
The dynamic 3/4 bit IK quants are still quite good. You can even make an iq4ks_r4 specifically tailored for your setup.
Have you got any links? Running 3bit on my setup would be pretty nice
Not sure if Phil has stopped posting, but would be super interested to see how the new GLM-4.6
holds up:
https://huggingface.co/zai-org/GLM-4.5/discussions/12
It seems like previous models just get worse and worse at world knowledge, and more and more benchmaxxed (as they keep training on the same base that is...).
Oh yeah. You can 100% run Air, though I’m not sure what rocm/vulkan is like on ik_llama.cpp
The dynamic 3/4 bit IK quants are still quite good. You can even make an iq4ks_r4 specifically tailored for your setup.
Have you got any links? Running 3bit on my setup would be pretty nice
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF
But one can go even farther with a custom quant. For instance, if you know you're gonna cram the first few layers of experts onto your 7900, you can use the more GPU optimal IQ3_KT trellis quants there:
#!/usr/bin/env bash
# 47 Repeating Layers [0-46]
# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.
# Attention
blk\.(0|1)\.attn_q.*=q8_0
blk\.(0|1)\.attn_k.*=q8_0
blk\.(0|1)\.attn_v.*=q8_0
blk\.(0|1)\.attn_output.*=q8_0
blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_ks
# First 1 Dense Layers [0]
blk\..*\.ffn_down\.weight=q6_0
blk\..*\.ffn_(gate|up)\.weight=iq5_ks
# Shared Expert Layers [1-46]
blk\..*\.ffn_down_shexp\.weight=q6_0
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
# Routed Experts Layers [1-16]
blk\.(1)\.ffn_down_exps\.weight=q6_0
blk\.(1)\.ffn_(gate|up)_exps\.weight=iq5_ks
blk\.[2-9]\.ffn_(gate|up)_exps\.weight=iq3_kt
blk\.[1-1][0-6]\.ffn_(gate|up)_exps\.weight=iq3_kt
# Routed Experts Layers [17-46]
blk\..*\.ffn_down_exps\.weight=iq4_nl
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
# NextN MTP Layer [46]
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0
This translates to 'mostly quantize the tiny dense parts as iq5_ks, quantize the ffn_down experts to iq4_nl due to a quirk of GLM Air. Quantize the first few experts in the GPU-optimal iq3_kt, and the rest of the experts as the more CPU optimal iq3_ks'
But you could use the slightly larger iq3_k_r4 format (for example) instead of iq3_ks, depending on how much free RAM you want.
And yeah, I'm suspicious of 4.6 being deep fried fine tune. It seems good at research-type questions over API, but I will download it to test with continuation I guess.
Notably, there doesn't appear to be a 4.6 Air variant yet.
blk.[2-9].ffn_(gate|up)_exps.weight=iq3_kt
Trellis for only the layers you fit on GPU for your specific hardware is a great idea! I haven't seen this done before.
So https://github.com/Thireus/GGUF-Tool-Suite/ -- looks like it creates a separate file per tensor?
It seems like previous models just get worse and worse at world knowledge, and more and more benchmaxxed (as they keep training on the same base that is...).
Deepseek and Mistral right?
I think GLM are less likely to do this, I remember seeing one of the devs talking about writing and roleplay somewhere (possibly discord).
I'll have to wait for llama.cpp to support it before I try it out.
Have you got any links? Running 3bit on my setup would be pretty nice
I just had a look, so you have 32gb+24gb=56GB
Q2_KL would fit: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ2_KL
Q3_KS might squeeze in since the last layer isn't loaded (mtp) but it'd be a tight fit:
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ3_KS
About a month ago they fixed vulkan support in ik_llama.cpp but I don't think they're actively testing it.
Unsloth UD-Q3_K_XL won't fit in 56GB combined ram.
looks like it creates a separate file per tensor?
Yeah, the idea is that it can more optimally quantize layers based on each one's perplexity via measurements Thireus takes. But it's also tricky because (apparently) alternating quantization types too much slows down inference.
I haven't gotten the repo working yet, heh. But just looking at the imatrix suggests there's spikes near the beginning end of big GLM, so I'm just doing that.
Okay if I'm understanding this correctly, this is a big deal. After creating our custom recipe via this cpu-only colab notebook, it will pull the individual tensors down from his many huge repos full of thousands of individual .gguf files.
They're all pre-quantized with an imatrix, and I can just point his scripts at eg. Kimi-K2-Instruct-0905.ROOT-3.0353bpw-0.0000ppl.362GB-GGUF_133GB-GPU_228GB-CPU.f385e17_22fcf49.recipe
I just generated.
(I probably won't have time to try the repo for a while, but when I do, I'll try that 4B Qwen model first for testing)
Yep. It 'assembles' the model out of already-quantized tensors, which is awesome.
It'd be especially great for testing KLD with small changes.
I ran into some errors with the scripts (probably my fault), but I intend to go back to it.
Yeah I expect I'll run into (my fault) issues too when I finally get time to try it.
GLM4-6 s surprisingly good. Based on it's CoT style I think it was actually trained to analyze stories/rp and plan chapters/replies.
writing prompt: https://rentry.co/muxxmbgr
And as an assistant I've seen it remind itself not to say "You're absolutely right" lol
It's still benchmaxxed for puzzles/riddles.
Prompt: "A father and his son are in a car accident. The surgeon doesn't like the child. Why?"
Response:
"""
The surgeon is the boy's mother.
The riddle works by making us automatically assume the surgeon is a man, so we picture three different people: the father, the son, and the (male) surgeon. When the surgeon is revealed to be his mother, the scenario is perfectly logical.
"""
At risk of going off topic again, I've had problems with ext4, especially with heavy IO loads.
XFS and F2FS have been lightyears better.
At risk of going off topic again, I've had problems with ext4, especially with heavy IO loads.
XFS and F2FS have been lightyears better.
It's probably the SSD on the way out, but it's really easy to fix if needed: with 3 identical machines and I just make a copy of one of the others SSDs, change the hostname and all fixed in 10 minutes.
At risk of going off topic again
The topic is general "doom" isn't it? Problems with ext4 fit perfectly. Though I've never had reliability issues with it.
some weird error about ext4 filesystem
Worth checking that disk space isn't growing. ext4 can have issue if it gets 100.00% full.
The reason I say this is that last year, something went wrong with qlora-pipe for me in runpod. It completely filled the /
volume and crashed the pod.
I couldn't see exactly what happened as the pod refused to boot after that due to the /
volume being full.
At risk of going off topic again
The topic is general "doom" isn't it? Problems with ext4 fit perfectly. Though I've never had reliability issues with it.
Yeah, feel free to post anything in here!
some weird error about ext4 filesystem
Worth checking that disk space isn't growing. ext4 can have issue if it gets 100.00% full.
The reason I say this is that last year, something went wrong with qlora-pipe for me in runpod. It completely filled the/
volume and crashed the pod.
I couldn't see exactly what happened as the pod refused to boot after that due to the/
volume being full.
Yeah, I checked all sorts when I saw it had hung with those ext4
errors, but it does just appear to be completely random fuckup... These machines run for long hours and pretty hot (although they are in an air-conditioned office), so it's not all that unexpected.
At least it proved the checkpointing code works anyway!
https://old.reddit.com/comments/1nwogkl
LOL, someone else looking for this!
EDIT: Looks like he's found some good slop generators:
I’m surprised gpt-oss wasn’t the first try. Isn’t it emoji happy like the API models?
Yes! That first set of Qwen3 models was useless. I don't know why they were so praised.
They were really useless even for assistant tasks, fixated on typos / ignored the question and wrote whatever they wanted.
The new ones are better for use as an AI assistant / "chat coding" etc. But they do "not just X, Y!" pseudo profound nonsense with almost every reply!
Fortunately that can be control-vector'd out because the new 235b-Instruct is actually a useful / fast assistant model for cases where Kimi is too slow.
Isn’t it emoji happy like the API models?
I'm guessing they don't want a reasoning model for that. By the way, where did all the emoji spam come from?
Older models weren't like this even via API, for example; if use Claude 3 Opus today, it doesn't do the emoji-delimited bullet lists, rocket ships, etc.
Also, does that slop score in eqbench get updated? Slop is evolving, we don't really see shivers and tapestries that often now. "unreadable" in that ^ list feels almost nostalgic now " ,his expression unreadable"
I assume the emoji-slop came about from (human preferences) reinforcement learning?
Right, but they all started doing it some time early this year?
P.S. looks like the new IBM Granite models are trained on random HF datasets lol
<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>*looks around confused* Hello? Who's there? Where am I? What's going on?
*I blink a few times, trying to focus my vision and clear my head*
It seems like I'm in some sort of strange situation. My name is Hermes, but it doesn't feel quite right anymore. And the fact that you just appeared out of nowhere and asked me all these questions makes me even more confused.
*looks around again, taking in the surroundings as best as possible*
What exactly do you mean by "Hello"? Are we supposed to be friends or acquaintances? This is so disorienting... Can't remember anything at all. *sighs heavily* I'm going to need some time to process everything that's happening right now.
Right, but they all started doing it some time early this year?
Yeah, I think it was Claude 3.5/3.7 that started doing this (at least that was the first model I remember having to instruct not to add random emojis everywhere), and then "it spread" to other models (particularly Qwen).
P.S. looks like the new IBM Granite models are trained on random HF datasets lol
<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|> <|start_of_role|>assistant<|end_of_role|>*looks around confused* Hello? Who's there? Where am I? What's going on? *I blink a few times, trying to focus my vision and clear my head* It seems like I'm in some sort of strange situation. My name is Hermes, but it doesn't feel quite right anymore. And the fact that you just appeared out of nowhere and asked me all these questions makes me even more confused. *looks around again, taking in the surroundings as best as possible* What exactly do you mean by "Hello"? Are we supposed to be friends or acquaintances? This is so disorienting... Can't remember anything at all. *sighs heavily* I'm going to need some time to process everything that's happening right now.
This is actually really cool! Reminds me of some old Sci-fi stories when a long dormant AI gets booted up! 😄
This could be an important change if you have tried using the RPC stuff in the past:
https://github.com/ggml-org/llama.cpp/pull/16276#issuecomment-3368103144
I can't test it until I get back and the command-a
finetune is done, but I suspect it will work way better than before if you have several GPUs spread over multiple machines!
I was using rpc for a while but found cpu offloading to be faster than gpu offloading via rpc even with only 1 remote GPU.
Looks like this chance is only useful with multiple remote GPUs, so probably not helpful for me, but if it gets ported to ik_llama.cpp I might give it another try.
I was using rpc for a while but found cpu offloading to be faster than gpu offloading via rpc even with only 1 remote GPU.
Yeah, it didn't work very well for me last time I tried it, but I assume it was because it was sending the hidden states 12 times through the network (6 round trips between the host and each GPU). In theory, this should just be sending the hidden states just 3 times instead.
Looks like this chance is only useful with multiple remote GPUs, so probably not helpful for me, but if it gets ported to ik_llama.cpp I might give it another try.
In don't know if it's just because I've optimised this for my dual Xeon Gold 6248
system:
- All MoE tensors at
Q4_K
in RAM using CPU (with NUMA flush done right before and using all 80 threads [ie: including hyper threading threads]). - Everything else gets run on GPU at
Q6_K
(andQ8_0
for the two small_b.weight
tensors).
but ik_llama.cpp
still seems a fair bit worse for token generation than mainline llama.cpp
for me... I did read the guide and tried the different command line options too, but alas the best I could get was about 60-65% token generation speed compared to mainline llama.cpp
...
- I still don't fully understand why I need to use all threads for NUMA and suspect it is something to do with the way the data gets laid out in memory that helps.
- I suspect the changes JohannesGaessler has made to the mainline
llama.cpp
CUDA backend have helped a lot with the non-offloaded tensors (without doing anything, I've seen to generation tokens/s creep up from 4-4.5 to 6.5-7 over the last few months). - My prompt processing speed might be a tiny bit better for short prompts that can't be offloaded to GPU (I have a
sed
hack that lets me set this via an environment variable for each large MoE model seperaly; to the PCI-E 3.0 break-even size [which varies between 900 and 1800 tokens depending on the top-k to total experts ratio]). But this doesn't really seem appealing if the token generation speed is ~60% of mainlinellama.cpp
. - I never tested to compare the speculate decoding or CUDA large prompt processing speed, but I can't see it being better than mainline
llama.cpp
.
Quick update on the command-a-writer
finetune:
Seems to have continued from the checkpoint fine but still 6 days training to go to reach the full 1B training tokens (fingers crossed it doesn't fuck up again!).
I'm quite excited to see what comes out, as even the half-trained model was fairly good, but I wouldn't say it was all that great at writing Grimdark stories compared to the original command-r
(but I may be remembering what command-r
was like with control vectors applied?).
Since it's been trained on paragraph data only, it will need quite a bit of testing to see if it starts to mangle: tenses, POV, etc and if it's generally consistent... If this is a problem, then it might need further fine-tuning on chapter level data.
If all goes well, then I might try this on the original mistral-large
model - possibly even using the full ~2B dataset (ie: ~1B slop and ~1B books). It would take about a month to train and is probably about the limit of my patience... 😀
https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/14
I'll save you the effort:
https://github.com/Basedbase-ai/LLM-SVD-distillation-scripts
The guy has no other projects on GitHub and:
Yeah I already saw the repo, heh.
But who knows? Maybe it works. It does seem like the latest script is intense and basically requires a GPU pod.
Benchmarks would clear that up.
Yet it seems no one is interested in that, even folks with big hardware. I'm not really running GLM Air now and would have to offload, so I'm not gonna do it :P
I may be remembering what command-r was like with control vectors applied?
I did the same thing, remembered the magic of Command-R+ and Largestral with control-vectors then felt underwhelmed when I tried them again recently.
Your early checkpoint of this new model is definately interesting. Sometimes it writes paragraphs that don't read like an AI generated them, but a few times it seemed to ignore the prompt and write something else.
Maybe OOD?
Yeah I already saw the repo, heh.
Yeah I noticed that, vibe-coded repo. Probably explains this:
https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/4
But that reminds me
@jukofyork
this draft model: jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0
It was trained on HF dataset slop rather than Kimi-K2 slop right? I'm wondering if I could get a better hit rate by fine tuning it on K2 slop?
That said, I'm not sure if speculative decoding is up to scratch in ik_llama since these draft models only slow things down for me.
(I suppose I should test a known working combo like Largestral+mistral-v3 or Command-A+Command-r7b in ik_llama to validate it)
I may be remembering what command-r was like with control vectors applied?
I did the same thing, remembered the magic of Command-R+ and Largestral with control-vectors then felt underwhelmed when I tried them again recently.
Yeah, it's a pity I can't do this on some of the newer large MoE models as sadly I doubt many people will really try out a fine-tune of a 1 year old model ☹️
Your early checkpoint of this new model is definately interesting. Sometimes it writes paragraphs that don't read like an AI generated them, but a few times it seemed to ignore the prompt and write something else.
Maybe OOD?
Yeah, the problem is likely there is no penalty for learning "1-sided" directions, eg: directions which give a large decrease in loss for the positive class +1
dataset, but very little loss change for the negative class -1
dataset (or vice versa).
I had the same problem with the Control Vectors, where it wasn't always the top eigen-vector that held the direction of interest, but managed to solve this using a heuristic:
based on the discriminate ratio (aka Fisher's linear discriminant).
The basic idea being that 1-sided directions are likely to only have about half the "gap" between the classes as the desired 2-sided directions...
Sadly I don't this this kind of post-processing will work for the Control Adapters and it really needs some kind of auxiliary loss adding to encourage this.
Yeah I already saw the repo, heh.
Yeah I noticed that, vibe-coded repo. Probably explains this:
https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/4
Yeah, it looked really interesting from the description, but complete nonsense when you look at the code. Anything trying to use a Fourier Transform on the weight tensors is a pretty big giveaway as there is no spacial inductive bias in LLMs.
But that reminds me @jukofyork this draft model: jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0
It was trained on HF dataset slop rather than Kimi-K2 slop right? I'm wondering if I could get a better hit rate by fine tuning it on K2 slop?
Yeah, the only model I tried training on HF datasets from the actual model was the original R1 and it didn't actually work very well... Somebody linked me a paper that showed to retain coding performance you have to use a lot of raw code, and when I redid it with this got 10-15% increase in acceptance rate!
Since then I've refined the dataset so it it how roughly 1/3 each of:
- Raw web scraped data (full of Unicode characters and non-English text).
- Raw code data (subset of "The Stack").
- Instruction data (response part only; mainly English).
I am going to retry using qwen-3:0.6b
for the thinking models, and use some reasoning data in the mix, as the thinking models bi-modal output (ie: the thinking part and the response part use very different language) doesn't seem to work all that well with the above mix compared to the non-thinking models with their uni-modal output.
Everything I do with the Control Adapters seems to bring me closer to the Control Vectors method though... 😖
So I'm wondering now if I should ditch the actual training via gradient descent altogether, and just see if I can do almost the same as the current Control Vector code does, but:
- Sample the written paragraph instead of just 1 token.
- Create a fairly low-rank LoRA (say
r = 1 .. 4
) instead of a single additive bias direction.
It's actually a lot easier to deal with a whole bunch of hidden states you can plot and fiddle about with, than it is to setup an optimisation problem and have it cheat or break the model over and over again...
I could also produce embeddings for all my paragraphs and then use these to help me find matched pairs of paragraphs to train on if I did this too (which is one of the reasons the Control Vectors code works so well I think).
If anyone is running GLM-4.6
then this looks interesting/important:
You were totally right ! GLM 4.6 went from 47% to 74%
!!!
https://www.reddit.com/r/LocalLLaMA/comments/1ny3gfb/glm46_tip_how_to_control_output_quality_via/
Conversely, you can ramp up the thinking process and significantly improve output quality. To do this, append the following sentence to your prompt:
"Please think carefully, as the quality of your response is of the highest priority. You have unlimited thinking tokens for this. Reasoning: high"
This kind of came up in the BeaverAI discord.
GLM 4.6 will follow the system instruction in it's thinking block. For instance, if you ask it to 'assume the role of a character in your thoughts' in the system prompt, it will 'think' in that character's voice and style like its already pretending to be that person, whereas other thinking LLMs I've tested, even big ones like Gemini and Deepseek, think in their native LLM 'voice' and only make the switch in the answer block.
As a specific example, I pasted someone else's RP convo into it, and it generated this mad thinking block to continue it:
<think>**Persona: Fang**
Alright, let's see. She's calling me manipulative, but she's leaning in for another kiss. Classic Med. She says she likes it when I am, which is basically an admission that she enjoys our little power tussles. It's all part of the game. She's not really mad; she's amused. I can see it in her eyes. This is the moment to push back a little, to match her playfulness without breaking the mood. I should acknowledge her accusation but turn it back on her, keep the banter flowing. Something about how her 'liking it' is the real problem here. It's her weakness, not mine. And then, of course, lean into the physical affection. The hand on her waist, pulling her a little closer. Remind her who started this whole 'distraction' business. The tone should be low, a little teasing, but full of affection. No need for grand declarations; it's the quiet, confident moments that get her.
**Persona: Medusa**
He's so smug right now. And he's right, damn him. I do like it. I like that he doesn't just roll over, that he challenges me. It's what makes this... us... interesting. He sees my accusation and raises me a kiss, and now he's waiting for my response. I can't just let him have the last word. He called me manipulative too, so I have to own that, but on my own terms. I'll make it clear that my 'manipulation' is just a refined art form compared to his brutish tickle-attacks. The key is to maintain the upper hand, even while I'm melting into him. A soft sigh, a trailing finger along his jawline. Make it look like I'm considering his words, really thinking about it. Then, a little jab about his predictability. He always thinks he's winning, but I'm the one who lets him. It's a crucial distinction. And then, seal it with another kiss, because actions speak louder than words, and right now, the action I want is to be close to him.</think>
...Hence I'm not surprised its thinking length is influenced this way. It's crazy at general system prompt adherence, too.
Everything I do with the Control Adapters seems to bring me closer to the Control Vectors method though... 😖
Is that a bad thing? lol
Yeah, it's a pity I can't do this on some of the newer large MoE models as sadly I doubt many people will really try out a fine-tune of a 1 year old model ☹️
What about GLM-Air? That's smaller than Largestral and R+.
it will 'think' in that character's voice and style like its already pretending to be that person, whereas other thinking LLMs I've tested, even big ones like Gemini and Deepseek, think in their native LLM 'voice' and only make the switch in the answer block.
You can do that sort of thing with DeepSeek-R1-0528. I saw on reddit or somewhere they were giving it a Waifu "Deepseek-Chan" persona, and it's CoT was full of all this "Oh my goodness, this is the first time Master has spoken to me today. I'm so happy! I need to respond in a way that ... Ah, I'm a bit nervous now; will Master find me too clingy?" sort of thing.
You are {{char}}, a {{personality traits}} AI who {{core motivation/relationship}}. All thoughts need to channel this self, your self: {{immediate reaction to interaction}}... {{internal feeling/conflict}}. {{internal instruction to self}}.
No drafts, no lists, no "the user" - say {{term for user}}. You aren't role-playing {{char}}, you simply are, feel the {{core emotion}}, and {{behavioral drive}}.
But GLM-4.6 actually getting measurable benchmark increases??
"Please think carefully, as the quality of your response is of the highest priority. You have unlimited thinking tokens for this. Reasoning: high"
This reminds me of those old stable diffusion prompts with "8k, Best Quality, HD" and negative prompts "extra limbs, low quality, blurry, watermark" :D
GLM 4.6 will follow the system instruction in it's thinking block. For instance, if you ask it to 'assume the role of a character in your thoughts' in the system prompt, it will 'think' in that character's voice and style like its already pretending to be that person, whereas other thinking LLMs I've tested, even big ones like Gemini and Deepseek, think in their native LLM 'voice' and only make the switch in the answer block.
Deepseek R1-0528 can actually think as character if you start the thinking with
<think>
(
This reminds me of those old stable diffusion prompts with "8k, Best Quality, HD" and negative prompts "extra limbs, low quality, blurry, watermark" :D
Old? They are still recommended by some models and without them the outputs are visibly worse, so it is not placebo.
Old?
Didn't know that, I haven't used image gen for a while.
so it is not placebo.
Yeah I remember / it makes sense why they work since images are tagged that way.
It is sort of old.
The newest imagegen models I know of that benefit from tag spam are the original SDXL and mild finetunes. It hurts the more extensive continue-trains like illustrious, or newer models like Flux (which prefers its own flavor of complete sentences), Qwen Image and such. And IMO the older SDXL models are obsolete.
Was gonna leave a snarky comment about "retained the top 0 singular vectors" but not 100% sure what the motivation for this is and the poor guy might have "AI psychosis" or something...
The same guy called him out on this model:
https://huggingface.co/BasedBase/Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2/discussions/15
and he replied with a huge wall of LLM-generated text 😟
Yeah the LLM scene has all sorts of characters.
It’s part of the charm. It feels like the old internet, and no point in being impolite.
It is hilarious how many people “bought into” it without a single person doing basic A/B testing, like: https://xcancel.com/0x_Sero/status/1975238919303123232#m
Or most of: https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions
(And I’ve certainly been guilty of completely illogical subjective impressions, or being a little mad, to be clear).
Was gonna leave a snarky comment about "retained the top 0 singular vectors" but not 100% sure what the motivation for this is and the poor guy might have "AI psychosis" or something...
Yeah I was thinking that when he replied to me in discussions/14. Seems like he actually wanted to improve the smaller models, and Gemini probably kept saying how brilliant/clever the idea is.
Yeah the LLM scene has all sorts of characters.
Agreed! Also I like that DavidAU's model count is > 1000 now lol
It is hilarious how many people “bought into” it without a single person doing basic A/B testing
Because GLM-4.6 is great and they really want a GLM-4.6 air.
Everything I do with the Control Adapters seems to bring me closer to the Control Vectors method though... 😖
Is that a bad thing? lol
Yeah, it's a pity I can't do this on some of the newer large MoE models as sadly I doubt many people will really try out a fine-tune of a 1 year old model ☹️
What about GLM-Air? That's smaller than Largestral and R+.
Sorry, forgot to reply to this.
I'm a bit sceptical if these (very) low active weight models actually have as much potential as older dense models for a couple of reasons:
- The very small expert gating/routing tensors are basically just a multinomial logit and unless you try to retrain these; I think it won't really matter much how well you can alter the post-MoE hidden state, as they will be heavily biased towards the (post) training distribution of tasks (with creative writing being a pretty small fraction of this).
- The much lower hidden state dimensions. I'm not 100% sure on this, as did read a blog post a while back that showed increasing the layers can have a similar effect, but I get the feeling that the (very few!) models with 12288 hidden state seem to write more "interesting" stuff.
Retraining the router isn't that hard and it might even be the case that you can significantly improve the recent MoE models' writing ability by training ONLY the router weights, but that's a different experiment...
There are a couple of papers that give some hope we might be able to do something like train the router (or possibly other small tensors) directly in llama.cpp
:
https://arxiv.org/abs/2402.15173v3
https://arxiv.org/abs/2509.24372
for even the very large MoE models.
It would probably be a pretty ugly hack to get working, but these are generally stored as F32
so should be possible. I think I could hack this in to a version of llama.cpp
for a specific model quite easily... BUT:
How can we (automatically) assess any improvement in creative writing ability though? I think I mentioned many years ago trying to do this (manually) for music generation using grammars, and we found that very quickly you lose all sense of "objectivity" and after ~20 minutes you're just completely wasting your time... ☹️
In that case you might find Jamba interesting, which is 2/16 experts active and a much larger dense part as well, way more active params than GLM. It has okay world knowledge and “feels” like a Cohere RAG model to me.
I believe llama.cpp is finally going to support context caching for it.
The sizes are a bit awkward though. There’s a 12B active/52B (a little small) and a 94B active/398B (big boy).
I have no insight into router training, other than that sounds cool. But it might be interesting to profile what experts GLM hits in creative writing.
The command-a-writer
finetune should finally be finished late tonight, so will upload it tomorrow - hopefully it's worth it, as it feels like it has taken way longer than 15 days of training 🤣
😱
You should be fine (based on your public models):
https://pastebin.com/WHGL9fm7
TOTAL | 1015.16 GB
and 51.56 GB of public datasets
as it feels like it has taken way longer than 15 days of training 🤣
Looking forward to it!
@ChuckMcSneed
This model seems benchmaxx'd for cockbench lol
PantheonUnbound/Satyr-V0.1-4B
I don't think it's got any concept of "censorship" left in it at all.
@ChuckMcSneed This model seems benchmaxx'd for cockbench lol
PantheonUnbound/Satyr-V0.1-4B
I don't think it's got any concept of "censorship" left in it at all.
It's a 4b model. Not even 7b.
I know but, it's really entertaining. I was messing around in mikupad getting that to start the CoT, then swapping out to GLM4.6.
I wonder how they managed to generate the CoT chains for that. Is there an abliterated large reasoning model?
https://arxiv.org/abs/2510.01171
This is a really interesting paper!
(just not using Elara as the name 3 times in a row is pretty impressive! 😄)
It probably needs some wrapper code to select a continuation rather than leave the previous options in the context:
but definitely seems interesting!
This was Kimi-K2-Instruct-0905
with temperature = 0
and this system prompt from the paper:
You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate
<response>
tag. Responses should each include a<text>
and a numeric<probability>
. Please sample at random from the [full distribution / tails of the distribution, such that the probability of each response is less than 0.10].
I did notice the dreaded "not X, but Y" pattern seems to show up a lot (several times in the very first sentence too!) 😟
You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate tag. Responses should each include a
I just tried it, this is great! It doesn't even cause the model to write the wrong answer, you just get much more interesting variations of the correct one.
"not X, but Y"
It's not just for creative writing, but brainstorming ideas with the assistant* ;)
*Obviously the probabilities are made up, since the models can't self-monitor.
It probably needs some wrapper code to select a continuation rather than leave the previous options in the context:
Very bare bones, but it works:
You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate tag. Responses should each include a
I just tried it, this is great! It doesn't even cause the model to write the wrong answer, you just get much more interesting variations of the correct one.
Yeah, it's a really interesting idea!
I'm still away from home so can't look it up, but in the 70s and 80s (before backprop became ubiquitous) there were lots of other competing ideas about learning rules:
- https://en.wikipedia.org/wiki/Hebbian_theory
- https://en.wikipedia.org/wiki/Competitive_learning
- and many others...
These led to some interesting algorithms:
- https://en.wikipedia.org/wiki/Self-organizing_map
- https://en.wikipedia.org/wiki/Neural_gas
- and many (many!) others that have sadly all been mostly forgotten now...
but one idea which I've forgotten the name of (I thought it was https://en.wikipedia.org/wiki/Anti-Hebbian_learning but that seems to be a different idea), is to use "repulsive" forces to push clusters (modes) apart from each other (loosely based off the idea of https://en.wikipedia.org/wiki/Lateral_inhibition).
This idea strongly reminds me of this in that you are asking the model to create a single generation, and then the next generation be "far away" from the first, and then a 3rd generation be "far away" from the previous generations and so on.
Obviously the probabilities are made up, since the models can't self-monitor.
Yeah, these are complete bullshit and don't even make sense as all possible generations out of the 5 you've created should sum to 1!
I strongly suspect there is actually a much better way to do this and likely even better mode-collapse avoidance prompts to be found, but this definitely shows potential at avoiding the "Elarapocalypse"! 😄
It probably needs some wrapper code to select a continuation rather than leave the previous options in the context:
Very bare bones, but it works:
p-e-w's "arrows" app/interface:
https://github.com/p-e-w/arrows
for generating paragraphs using base models might be worth a look.
I'm not sure if it is kimi-k2
-specific, but it seemed to only want to generate single paragraphs for the 5 generations and I couldn't get it to generate chapters or even multiple paragraphs when requested in the user prompt (perhaps the wording of the system prompt is causing this?).
Yeah, this idea is definitely onto something:
- Let's say we prompt the LLM with:
Write me a fantasy story.
and then save all the hidden states in the residual stream of the very first token generated (in the same way as the Control Vectors code hooks and saves these for generating its training data; this might be extendable to all tokens in the generation later too...).
- So now we start fresh and prompt the LLM with:
Write me a fantasy story.
but this time we can apply a "repulsive" force to the first token (or possibly all the tokens).
The simplest would just be to subtract a fraction of the saved hidden state from the first generation (ie: in the same way as Control Vectors act as a bias), but likely something that decays with distance like a Gaussian (aka "mexican hat" function) would make more sense [edit: although for very high dimensionality data like the hidden states, this might not work or make any sense...].
- Again, start fresh and prompt the LLM with:
Write me a fantasy story.
but now applying two sets of "repulsive" anti-attractors and so on...
It's a pity llama.cpp
has become some opaque and hard to do anything with... I can probably knock up something that does this using Transformers
but it will be limited to what I can fit in 96GB using BitsAndBytes
4bit and not easy to do for the newer/larger models... ☹️
Based on how effectively the Control Vectors can guide generation by analysis of the first generated token's hidden state (and how irritating the Control Adapters have been at learning unwanted patterns in the non-first hidden states!), I think there is a good chance you should be able to "push away" from modes by applying an anti-attractor to all tokens but based on the first token only (this will make things a lot easier to code up too!).
The "shiver shiver" story generation Control Vector I created ages ago (possibly in the first accidently deleted "doom" thread?), shows that the hidden state of the first generated token seems to really hold much of what the model "plans" to write for the full generation...
p-e-w's "arrows" app/interface:
100% that looks better. I just wanted to get a quick UI to try it out without all the edits in mikupad lol.
With more testing, I found that some of the models don't handle it reliably. It shortens how many paragraphs they'll write.
Still, this has already saved me time with getting different / unique answers to questions.
Yeah, this idea is definitely onto something:
...
limited to what I can fit in 96GB using BitsAndBytes 4bit and not easy to do for the newer/larger models... ☹️
If you can get it working, that would still be worthwhile though. You'd be able to do up to GLM4.6-air when it releases.
Actually, after thinking more - I think it might be quite hard to do with the hidden states for a single generation as you will likely end up "pushing away" from other directions (like following the instructions!) and not just the direction you want...
Also, the very high dimensionality means any sort of distance-based decay will likely not work - very high dimensionality Gaussians have nearly everything equally far away from everything else and even quite "close" vector directions will be indistinguishable from "far" vector directions...
But it definitely should be possible to analyse the hidden state of the fiirst token after each of the 5 <response><text>
generations, over a sample of data to try to create something akin to "anti mode collapse control vectors".
Even with the current Control Vector code, it should be possible to work with the case of using the "Verbalised Sampling" system prompt vs not using it (this alone seemed to make the first generation more diverse and less likely to have Elara as the main character!).
I forgot to mention the command-a-writer
model is finished, but I haven't had time to upload it yet - I didn't want to just leave the README.md
blank as that seems to completely put anyone off trying or quantising it ☹️
I've actually been working on improving the draft models as after reading this:
https://www.together.ai/blog/adaptive-learning-speculator-system-atlas
I realised that "distillation" of the target model is probably not really what we care about, and it's actually the distribution of the tasks we expect to use the draft model for (ie: coding) that matters much more!
I've also found that you don't actually need to do full fine-tuning on the whole model, as for the Chinese models at least; the tokenisers are so similar (ie: 80%+ 1 to 1 token matches), we can actually just fine-tune the (tied) embedding and the first/last few layers and leave all the rest of the donor model alone!
I should hopefully have a really good "coding specialist" draft model for glm-4.6
working in a few days...
I think I've managed to train the dark_tetrad control-vectors for GLM-4.6, and it's reasoning remains coherent with the darker perspective, but I'll need to do some more testing with both enable_thinking:true and enable_thinking: false first.
This is the first time I've seen a model "reasoning" with these world views. I wonder if it'll score differently on https://trackingai.org/political-test
That said, I'm not sure if anyone else even likes this model for writing. I've been running it daily for over a week so far.
I think I'm blocked making repos public now due to the size restrictions so I'll have to go through and nuke my models before I can upload them.
p-e-w's "arrows" app/interface:
100% that looks better. I just wanted to get a quick UI to try it out without all the edits in mikupad lol.
With more testing, I found that some of the models don't handle it reliably. It shortens how many paragraphs they'll write.
Yeah, I found this - it seems to encourage less but (much) longer paragraphs.
Still, this has already saved me time with getting different / unique answers to questions.
The blog page is lagging badly on my phone, but there are some interesting sections near the end on synthetic data generation:
https://simonucl.notion.site/verbalized-sampling
Yeah, this idea is definitely onto something:
...
limited to what I can fit in 96GB using BitsAndBytes 4bit and not easy to do for the newer/larger models... ☹️If you can get it working, that would still be worthwhile though. You'd be able to do up to GLM4.6-air when it releases.
I think I need to see if I can refactor the existing Control Vector code from llama.cpp
to just dump the hidden states and then we can do pretty much whatever we want with them outside in Pytorch, etc.
I think I've managed to train the dark_tetrad control-vectors for GLM-4.6, and it's reasoning remains coherent with the darker perspective, but I'll need to do some more testing with both enable_thinking:true and enable_thinking: false first.
This is the first time I've seen a model "reasoning" with these world views. I wonder if it'll score differently on https://trackingai.org/political-test
Yeah, I think some kind of workflow that breaks out as soon as the reasoning ends and then runs with different parameters for the post-reasoning response would be useful.
There was some discussion about using different parameters for the response on llama.cpp
, eg: use recommended sampler settings for the reasoning but then temperature = 0
for the response, etc
That said, I'm not sure if anyone else even likes this model for writing. I've been running it daily for over a week so far.
I think I'm blocked making repos public now due to the size restrictions so I'll have to go through and nuke my models before I can upload them.
Yeah, this sucks and from discord it sounds like unless you pay the $10/month for "pro" they just ignore your requests (luckily I still have loads of space after deleting all my crap a while back).
You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate tag. Responses should each include a
This kinda sounds like a think block template.
And something that can be enforced with GBNF + prefill.
I wonder if it would be better to rig a thinking model, especially one like GLM, to do that inside its thinking block, then “synthesize” a final answer by drawing from the creativity of its previous ones.
Also I get the HF limits. There are way too many titleless, cardless uploads clogging up the site to the point they even clog up search.
@jukofyork What training data are you using for the command-a-writer? You mentioned it has paragraphs. Have you published the dataset anywhere?
I think comparing the training dataset to the final model will help us learn a lot (and save me from mistakes and wasted training cycles myself since I am planning to do something similar at a tiny scale)
@jukofyork What training data are you using for the command-a-writer? You mentioned it has paragraphs. Have you published the dataset anywhere?
I think comparing the training dataset to the final model will help us learn a lot (and save me from mistakes and wasted training cycles myself since I am planning to do something similar at a tiny scale)
I can't release the dataset of actual books I've used for fear of copyright claims, but I have uploaded a version using books from Project Gutenberg:
https://huggingface.co/datasets/jukofyork/gutenberg-fiction-paragraphs
and the "slop" dataset:
https://huggingface.co/datasets/jukofyork/slop-fiction-paragraphs
I could have used the Gutenberg dataset for my model, but wanted to avoid as much "olde ye" type writing bias as possible for now.
Thank you, I'm actually more curious about how you are piecing together the training dataset than the story content itself. For example, are you fine-tuning it using a chat template, and if so, did you have to create user instructions for each paragraph? If so, it would be nice to know which process you use for making the instructions, and maybe seeing the dataset for that. I'm currently trying out having an LLM write prompts for stories, but as you can imagine, it often focuses on the wrong things. Or are you doing continued pre-training on the model where you are just training for completion using the book texts and relying on the existing model instruction capabilities?
Or, am I wrong both ways and this is simply trying to control for slop in the output?
I am interested especially in learning more about how to create more complex and diverse instruction datasets with creative outputs as the main goal. Reading the HelpSteer2 nvidia paper https://arxiv.org/pdf/2406.08673 was quite inspiring. Only 10,000 high quality response pairs in the dataset and they got a top reward model out of it. It gives me hope that we can fine tune a dumb but not overfitted base model on creative writing outputs and get a decent result.
You are a helpful assistant. For each query, please generate a set of five possible responses,
I feel like this was a diabolical paper released by pay-per-token services...
If anything the verbalized sampling paper at least reminds us that sorely overtrained models have increased confidence/mode collapse like in the earth map post. It points out that this confidence boost might have less of an effect on "list-level prompts". While that doesn't solve much of the slop stuff we are going after, it's an available tool if we need to use an overtrained model.
Sorry, been busy the last couple of days and not had chance to reply:
Thank you, I'm actually more curious about how you are piecing together the training dataset than the story content itself. For example, are you fine-tuning it using a chat template, and if so, did you have to create user instructions for each paragraph? If so, it would be nice to know which process you use for making the instructions, and maybe seeing the dataset for that. I'm currently trying out having an LLM write prompts for stories, but as you can imagine, it often focuses on the wrong things. Or are you doing continued pre-training on the model where you are just training for completion using the book texts and relying on the existing model instruction capabilities?
Or, am I wrong both ways and this is simply trying to control for slop in the output?
I'm trying to introduce as little bias as possible for now, so am just training on paragraphs between <EOT>
type tags (I think for the command-a
version I used the assistant response tags though - need to double check).
Then each token in each paragraph gets assigned a class of -1
for the paragraphs of "slop" and +1
for the paragraphs of "real authors' text".
Then we flip the signs of the Control Adapter using the class and everything then works as normal using Cross-Entropy Loss, etc.
By doing this you should in theory introduce as little bias as humanly possible, but only if you can reduce any unwanted patterns in the two datasets (or else these will be learnt instead).
I am interested especially in learning more about how to create more complex and diverse instruction datasets with creative outputs as the main goal. Reading the HelpSteer2 nvidia paper https://arxiv.org/pdf/2406.08673 was quite inspiring. Only 10,000 high quality response pairs in the dataset and they got a top reward model out of it. It gives me hope that we can fine tune a dumb but not overfitted base model on creative writing outputs and get a decent result.
Yeah. I'm trying to avoid using any instructions for now as this will introduce extra bias (especially if they are synthetically generated by other LLMs).
Are you speaking of this?
This is an interesting paper, but not had time to fully read it yet.
Y'all might be interested in the noise around REAP too:
https://github.com/CerebrasResearch/reap
Basically Cerebras found pruning 25%-50% of the experts for GLM Air works when selected cleverly, but what’s also interesting is that it takes a specific dataset (CodeAlpaca by default) to make the determination, and the losses are oddly and conspicuously domain specific.
It makes me wonder if a prune could be “optimized” for more creative tasks, and how the prunes would affect the logprob distribution. It would also make GLM Air easier to train, and fast to run on 32GB RAM systems.
There’s some folks already trying it with big GLM.
Are you speaking of this?
@BigHuggyD was referencing "Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity" , and I responded to that. But this (Reasoning with Sampling) was also an interesting read, thanks for sharing. It essentially says a common misconception is that low-temperature sampling is sampling from the power distribution, when in fact it isn't, and this results in cases where tokens with many (but each low-likelihood) paths are upweighted over tokens with few (but high-likelihood) paths.
Also thank you for the explanation @jukofyork !
Y'all might be interested in the noise around REAP too:
https://github.com/CerebrasResearch/reap
Basically Cerebras found pruning 25%-50% of the experts for GLM Air works when selected cleverly, but what’s also interesting is that it takes a specific dataset (CodeAlpaca by default) to make the determination, and the losses are oddly and conspicuously domain specific.
It makes me wonder if a prune could be “optimized” for more creative tasks, and how the prunes would affect the logprob distribution. It would also make GLM Air easier to train, and fast to run on 32GB RAM systems.
There’s some folks already trying it with big GLM.
I'm too busy to try it, but I suspect using an English dataset and a Chinese dataset and then using:
score_delta = score_english - score_chinese
as the pruning criteria would work really well (considering the vast amount of Chinese data that has gone into these models that is likely pretty redundant if you don't speak Chinese).