Wur doomed!

#14
by jukofyork - opened

Continuation of THE THREAD OF DOOM.

jukofyork pinned discussion

What do you and the others think of the distilled R1 models for writing?

The llama3 / qwen models SFT'd on R1 outputs? I only tried 2 of them.

R1 Qwen (32b) - Lacks knowledge of fiction (same as the official Qwen release), so it's writing is no better.

R1 Llama3 - This is generally the worst of them (not just for writing). It'll generate the CoT and then write something completely different.

CoT traces won't let the model do anything out of distribution, so not very useful if the base model doesn't have a lot in it's training data.

Yeah, I have tried the same two and felt the same way.

I also felt that any attempt to add an R1 distill to the merge recipe of an existing merge project made it worse...so far...

@gghfez @BigHuggyD that has been my experience as well, which is a shame as I had a go of R1 on Openrouter and I was blown away.

What model is anywhere close that is usable on a 24gb vram machine with 32gb of ram in your experience?

There's nothing like it for now. I'm running R1 slowly on my ThreadRipper:

prompt eval time =   14026.61 ms /   918 tokens (   15.28 ms per token,    65.45 tokens per second)
       eval time =  398806.12 ms /  1807 tokens (  220.70 ms per token,     4.53 tokens per second)
      total time =  412832.73 ms /  2725 tokens

I tried training Wizard2 8x22b MoE on R1 data, but it doesn't really work well. It will plan ahead in think tags eg:

I need to ensure the story maintains its gritty, realistic tone without becoming overly melodramatic. The characters' growth should be subtle but significant. Also, the ending should leave a sense of hope but not be too neatβ€”their redemption is fragile, and the future is uncertain.

Let me outline the next few chapters:

Chapter 5: Nightmares and Trust
...

But it doesn't backtrack like R1 does. Just kind of agrees with it's self and ends up writing how it usually would:

β€œI don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead.

lol

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

Maybe my use case is not what it's intended for - creative writing.

Yeah, it works best for things like code (where you repeat the same multi-token variable names often) or where your prompt asks the model to repeat large sections of what you have already given it.

Makes sense then. Shame.

@bartowski I discussed this last year in this thread:

https://github.com/ggml-org/llama.cpp/pull/6844#issuecomment-2194362702

but didn't get any further as it's still not clear what we want to optimise... :/

If you want to experiment with this then I'd just try to hack/parameterise the llama_tensor_get_type() function, pick a really small model like qwen:0.5b, code up CEM or SPSA (they are both just 20 lines of code at most), and see if you can get anywhere with it.

If you can then scaling up to the bigger models may be worth investing some extra time implementing other ideas (eg: training a regression tree over many different models' data as I mentioned in that thread), but overall if you can't find a good optimisation criteria or get it working for a tiny model then it's not worth even considering huge models like deepseek-r1.

Just noticed the new Despseek came out yesterday. Any good for writing?

I noticed Unsloth are giving the mlp modules more bits in their dynamic quants this time 🀞

Just noticed the new Despseek came out yesterday. Any good for writing?

I noticed Unsloth are giving the mlp modules more bits in their dynamic quants this time 🀞

It's early, but so far I like it better than R1 for mutil-turn blend of creativity/coherence. Seems to keep it together longer as the context grows. On the opposite side, I haven't had any of the... laugh out loud, or jaw-dropping, "I can't believe a LLM just wrote that." moments. I suppose it's hard to have something that can produce an off-the-wall unique one-liner and not also make whole paragraphs and chapters off-the-wall, too.

I've tidied up the transplant-vocab code now:

https://github.com/jukofyork/transplant-vocab

and am still working on deepseek-r1 and deepseek-v3 speculative models (turns out larger models were a waste of time, so now trying to trim down qwen-2.5 to be even smaller...).

It's early, but so far I like it better than R1 for mutil-turn blend of creativity/coherence.

I only tried it at a short context (12K). I agree it stays more coherent and less fixated on certain details at long context.
This one doesn't seem to quant as well as R1.

So... who is going to do a ties-merge (DS-V3.5 with R1)? lol
4TB SSD for BF16 upcast + infinite patience testing it on CPU?

I've tidied up the transplant-vocab code now:

That readme with examples looks good. I can see why my attempts to use it for command-a failed now.

I've tidied up the transplant-vocab code now:

That readme with examples looks good. I can see why my attempts to use it for command-a failed now.

Yeah, I tried hard to make it clear how to use it, but overall the transformers model files seem pretty brittle and it's not easy to be sure what works on one model works on another sadly :/

I committed the final thing I can think of today so it should be pretty stable now and only bug-fixes to be added (I was going to try shrinking the hidden_dim but it got too complicated due to all the layer_norm and constraints of being a multiple of the number of heads, etc).

I'm currently training up a 0.3B model with half the layers of qwen-2.5-instruct:0.5b removed and the intermendiate_size halved too. So far looks to be recovering nearly all the damage and should run way quicker in llama.cpp.

I only tried it at a short context (12K). I agree it stays more coherent and less fixated on certain details at long context.
This one doesn't seem to quant as well as R1.

So... who is going to do a ties-merge (DS-V3.5 with R1)? lol
4TB SSD for BF16 upcast + infinite patience testing it on CPU?

Yes precisely! R1 would latch on one detail of my prompt and not let go.
I have spent much more time with it, and it's definitely my number one on the multi-turn now. I've done about seven different scenarios and was able to get to a satisfactory conclusion to six of them, all around 28k to 42k context,t whereas with R1 I'd have no shot of making it that far.
I keep saying that this is my last month with that startup, but they keep extending. 😜 I wonder if I could convince them to let me use some hardware to take a stab at a merge. I could pitch it as some publicity. They recently joined the OpenRouter provider family and are looking for subscribers.
Is ties-merge pretty straightforward or nuanced?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment