mradermacher/model_requests · Issue; mradermacher/Qwen2.5-2X32B-CoderInstruct-OlympicCoder-80B-GGUF

2 days ago

•

Seems to be an issue with source - config files - , addressing / testing now.
This borked the quants (tested). Suggest delete this repo:

mradermacher/Qwen2.5-2X32B-CoderInstruct-OlympicCoder-80B-GGUF

New config files have been uploaded, and renamed model:
DavidAU/Qwen2.5-2X32B-CoderInstruct-OlympicCoder-87B-v1.1

Testing new config/quants in progress...

UPDATE:
Yep , tokenizer files were borked; source/safe-tensors are fine.
New config files corrected issues.

Sorry about that!

nicoboss

2 days ago

OK deleting Qwen2.5-2X32B-CoderInstruct-OlympicCoder-80B and queuing DavidAU/Qwen2.5-2X32B-CoderInstruct-OlympicCoder-87B-v1.1

nicoboss

2 days ago

It's queued! :D
And the broken one nuked.

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#Qwen2.5-2X32B-CoderInstruct-OlympicCoder-87B-v1.1-GGUF for quants to appear.

DavidAU

2 days ago

thanks so much!

mradermacher

Owner 2 days ago

There are nan's inside Qwen2.5-2X32B-CoderInstruct-OlympicCoder-87B-v1.1 blk.49.attn_q.weight

DavidAU

1 day ago

•

edited 1 day ago

??? Specific quant or all?
Error occurs when?

convert or quantize?

Ran tests locally before updating the config files at the repo.
Quants works correctly.

First batch were screwed because of the tokenizer issues - my bad there.
Checked: Q2k / IQ4xs at the repo online -> (click on Quant, right menu -> shows all about the quant) -> checked layer 49 - looked normal (?)

Going to check a quant from the repo online now...

DavidAU

1 day ago

•

edited 1 day ago

Downloaded quant is borked.

Damn. Looks like source code corrupted during upload.
No ; looks good.
Wiped safe-ts from 20-29 (error layer/tensor in 27) ; uploaded.
Hashes match local source.

Guy did MLX version from same source; works correctly [double checking this]
.
Llamacpp errors?

It is possible you are using / getting the "old" (cached?) config json files - if so, this will completely bork the quants.
I replaced all the .json files, as well as wiping some excess ones.

Maybe HF is caching them (old) ?

I put a notice at the repo about quants issues ; will look into this more in AM local time, late pm here now.

nicoboss

1 day ago

•

edited 1 day ago

The NaN gets detected when loading the source GGUF which in ouer case happens when we try to compute its imatrix. I could nuke and retry the model if you think that helps.

nico1 /tmp# cat Qwen2.5-2X32B-CoderInstruct-OlympicCoder-87B-v1.1.log

= Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  12:                  general.base_model.1.name str              = OlympicCoder 32B
llama_model_loader: - kv  13:          general.base_model.1.organization str              = Open R1
llama_model_loader: - kv  14:              general.base_model.1.repo_url str              = https://huggingface.co/open-r1/Olympi...
llama_model_loader: - kv  15:                               general.tags arr[str,17]      = ["merge", "programming", "code genera..."
llama_model_loader: - kv  16:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  17:                       qwen2moe.block_count u32              = 64
llama_model_loader: - kv  18:                    qwen2moe.context_length u32              = 32768
llama_model_loader: - kv  19:                  qwen2moe.embedding_length u32              = 5120
llama_model_loader: - kv  20:               qwen2moe.feed_forward_length u32              = 27648
llama_model_loader: - kv  21:              qwen2moe.attention.head_count u32              = 40
llama_model_loader: - kv  22:           qwen2moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  23:                    qwen2moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  24:  qwen2moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                 qwen2moe.expert_used_count u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 1
llama_model_loader: - kv  27:                      qwen2moe.expert_count u32              = 2
llama_model_loader: - kv  28:        qwen2moe.expert_feed_forward_length u32              = 27648
llama_model_loader: - kv  29: qwen2moe.expert_shared_feed_forward_length u32              = 27648
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  32:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  35:                      tokenizer.ggml.merges arr[str,151387]  = ["_ _", "__ __", "i n",
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 2
print_info: n_expert_used    = 2
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 87.12 B
print_info: general.name     = Qwen2.5 2X32B CoderInstruct OlympicCoder 87B v1.1
print_info: n_ff_exp         = 27648
print_info: n_ff_shexp       = 27648
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 '_'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 6 repeating layers to GPU
load_tensors: offloaded 6/65 layers to GPU
load_tensors:        CUDA0 model buffer size = 15300.69 MiB
load_tensors:   CPU_Mapped model buffer size = 166177.39 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =    12.00 MiB
llama_kv_cache_unified:        CPU KV buffer size =   116.00 MiB
llama_kv_cache_unified: size =  128.00 MiB (   512 cells,  64 layers,  1 seqs), K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:      CUDA0 compute buffer size =  1832.01 MiB
llama_context:  CUDA_Host compute buffer size =    11.01 MiB
llama_context: graph nodes  = 3782
llama_context: graph splits = 1106 (with bs=512), 3 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512

system_info: n_threads = 1 (n_threads_batch = 1) / 54 | CUDA : ARCHS = 890 | FORCE_MMQ = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 355.958 ms
compute_imatrix: computing over 318 chunks with batch_size 512
nan detected in blk.49.attn_q.weight

DavidAU

1 day ago

@nicoboss

Hmm.
Reg quants (online) are borked, and now an Imat error. (?)

MLX quants are working.
Local generated GGUFs are working. (NOTE: Have not imatrixed it yet locally)

Wonder if this is a caching issue at HF, and just need to upload from scratch a new repo?
Seems bizarre.

NOTE:
Any errors with new OpensourceCoder 7B 1.1 from Nvidia released yesterday?
It is same ARCH - Qwen 2.5 - as my model with the "gguf issues".

I did get errors when I downloaded source from Nvidia; and locally quanted (convert...hf) ; but the quant (q8) worked.
Going to try an older LLamacpp commit and newest too.

There are also tickets at Llamacpp/issues stating problems with several commits for Qwen 3; which seem to go back a fair bit.
Going to do more testing locally.

DavidAU

1 day ago

•

edited 1 day ago

UPDATE:

Suggest re-nuking.
If that does not help; I will create/upload fresh.

Worst case: Create bf16.gguf locally -> I will upload that in "parts".
Still concerned maybe a LLAMACPP commit issue(s).

Larger issue -> Lots of Qwen2.5 MOEs to upload source + quant. ; that would be a nightmare.

More local testing in mean time to see if I can root out issue(s).

UPDATE:
Confirmed "Convert-to.hf..." and "llama-quantize.exe" (cuda, 12.4) on latest LLAMAcpp commit today (b5849) -> Model works correctly ; no issues.

mradermacher

Owner 1 day ago

•

edited 1 day ago

just to clarify:

Local generated GGUFs are working. (NOTE: Have not imatrixed it yet locally)

llama.cpp does not by default check models for nans when using them. just because it loads and seemingly works does not mean the problem isn't there, just that it wasn't detected becasue llama.cpp wasn't looking. you can try --check-tensors with e.g. llama-cli to make it check during load.

When calculating an imatrix you would use llama-imatrix, which, by default, has check-tensors enabled. this is why we only detected this problem when trying to generate an imatrix. it's a problem in the source weights, not in any quant (unless convert_hfd_to_gguf.py converts weights and that would result in a nan).

It's not uncommon to have defective weights in models that seem to work fine, e.g. because these weights are somehow not used. but it has become quite uncommon ever since llama.cpp can check for them :)

It can also happen that nans are introduced during quantization or imatrix calculations - llama.cpp also checks tensors when writing them out for that reason. This is not the case here.

mradermacher

Owner 1 day ago

I've nuked and requeued the model. Keeping fingers crossed :)

DavidAU

1 day ago

•

edited 1 day ago

Hopefully this will do it; I swear it presents as a possible corrupt safetensors file (upload) ; but the hashes match.
Only way to bypass (if the hashes are lying) is new repo, new uploads.
Still a question of caching -> json files. Anyhow...

I downloaded q2k from your repo (both times 80B/87B repo) - completely borked, not just a little.

q2k here -> different llamacpp commits -> no issues, runs perfectly.

Maybe my machine lost some bits somewhere...

mradermacher

Owner 1 day ago

btw., you write on the model page that our quants are not working - could you provide some info here on what is not working? that might help diagnose any issues.

mradermacher

Owner 1 day ago

•

edited 1 day ago

also, json files should be irrelevant (for the imatrix failure), as the problem is in the model weights

DavidAU

1 day ago

•

edited 1 day ago

Both tests; q2k:

Here is sample output from coding testing prompt:

Remote GGuf:
&210%+0+3,&1,&4+*44.+($2"&+

Local GGUF

To achieve this, you need to use the Fetch API with ReadableStream for handling...

(what it should be)

I took the message off the repo page.

Sorry - just really frustrated with this issue ; low on sleep.

mradermacher

Owner 1 day ago

•

edited 1 day ago

nan detected in blk.49.attn_q.weight

same problem. I don't think this is a download problem, this is almost certainly a problem with the model weights. it might not be the same problem causing your garbled output, though.

do your ggufs load with --check-tensors? (the quants will likely load, but does the originally converted gguf with the source weights load?)

mradermacher

Owner 1 day ago

I took the message off the repo page.

I didn't intent to bully you into removing the message - if the output is garbled like that it's a great service to people to warn them about possible issues so they don't download a potentially useless large file

DavidAU

1 day ago

•

edited 1 day ago

No worries; was not taken that way.

See there is a new q2k - > downloading now.

If this doesn't work; I will re-build source from scratch, then upload again.
That should cover all possible reasons for the issues.

FYI: Tried to upload the safetensors (20-29 , blk.49 in this group) ; after manually deleting from repo - > HF just instantly replaced them due to hash matching.

opps; saw second message: no, no "--check tensors" when loading ggufs.

mradermacher

Owner 1 day ago

•

edited 1 day ago

For fun, I downloaded the Q2_K. No nan detected during load (as expected for this quant, really). Seems to work fine. Actually quite the overachiever, and very good for a Q2_K :)

Hi!

Hello! How can I assist you today?

Write me a nice spicy ecmascript function to compute the nth bernoulli-number.

Sure, I can help with that. The Bernoulli numbers can be computed using various methods, but one approach is to use the generating function or a recursive formula. However, the recursive approach can be a bit complicated, but it's manageable.

Here's an ECMAScript (JavaScript) function to compute the nth Bernoulli number using the recursive formula. Note that the Bernoulli numbers are rational numbers, so the function will return them as fractions.

But first, a quick note: Bernoulli numbers are often defined with B_0 = 1, B_1 = -1/2, and for n >= 2, they can be computed using the formula:

[ B_n = -\frac{1}{n+1} \sum_{k=0}^{n-1} \binom{n+1}{k} B_k ]

However, for n >= 2, the formula is often adjusted to start with B_1 = 1/2 for the sake of convenience in some contexts.

To handle fractions, we can use a simple fraction class or just return the numerator and denominator as an array.

Here's the code:

class Fraction {
    constructor(numerator, denominator) {
        this.numerator = numerator;
        this.denominator = denominator;
        this.simplify();
    }

    simplify() {
        const gcd = Fraction.gcd(this.numerator, this.denominator);
        this.numerator /= gcd;
        this.denominator /= gcd;
    }

    static gcd(a, b) {
        while (b !== 0) {
            [a, b] = [b, a % b];
        }
        return a;
    }

    static add(fraction1, fraction2) {
        const numerator = fraction1.numerator * fraction2.denominator + fraction2.numerator * fraction1.denominator;
        const denominator = fraction1.denominator * fraction2.denominator;
        return new Fraction(numerator, denominator);
    }

    static multiply(fraction, scalar) {
        return new Fraction(fraction.numerator * scalar, fraction.denominator);
    }

    static fromNumber(n) {
        return new Fraction(n, 1);
    }

    toNumber() {
        return this.numerator / this.denominator;
    }

    toString() {
        return `${this.numerator}/${this.denominator}`;
    }
}

function bernoulli(n) {
    if (n === 0) return new Fraction(1, 1);
    if (n === 1) return new Fraction(1, 2);
    if (n % 2 !== 0) return new Fraction(0, 1); // For odd n > 1, B_n = 0

    const b = new Array(n + 1);
    b[0] = new Fraction(1, 1);
    b[1] = new Fraction(1, 2);

    for (let m = 2; m <= n; m++) {
        let sum = Fraction.fromNumber(0);
        for (let j = 0; j < m; j++) {
            const binom = binomial(m, j);
            sum = Fraction.add(sum, Fraction.multiply(b[j], binom));
        }
        b[m] = Fraction.multiply(sum, Fraction.fromNumber(-1 / (m + 1)));
    }

    return b[n];
}

function binomial(n, k) {
    if (k === 0 || k === n) return 1;
    k = Math.min(k, n - k); // Take advantage of symmetry
    let res = 1;
    for (let i = 1; i <= k; i++) {
        res *= (n - i + 1);
        res /= i;
    }
    return res;
}

// Example usage:
console.log(bernoulli(0).toString()); // 1/1
console.log(bernoulli(1).toString()); // 1/2
console.log(bernoulli(2).toString()); // 1/6
console.log(bernoulli(4).toString()); // 1/30
console.log(bernoulli(6).toString()); // 1/42

This code includes a Fraction class to handle rational numbers and a bernoulli function to compute the nth Bernoulli number. The binomial function is used to calculate binomial coefficients.

Note that for odd n > 1, the Bernoulli number is 0. The function handles this by returning 0 immediately for such cases.

You can test the function by calling it with different values of n and seeing the output. The toString method of the Fraction class will give you the Bernoulli number as a string in the form of "numerator/denominator".

Let me know if you need any more help or modifications!

Wow. And where is the spicyness, exacvtly? One line answer only!

The "spicyness" comes from the use of recursive and mathematical operations in a concise and efficient manner.

DavidAU

1 day ago

•

edited 1 day ago

That is GREAT NEWS!

!!!

Still downloading here; if the IMATs fail due to weight issue, I will re-build/re-upload; separate repo & Version #.

Thank you very much for your patience and help ;

Broken and then...

--- > It seems to be a CUDA problem.
---> Switched "cudas" Runtimes (Windows general Cuda VS Windows Cuda Vx) -> and it works.

That broken weight - layer 49 - must break the camels back, in specific cuda version(s).

One Cuda: Fine, no issues.
Other Cuda - BOOM ... Gibberish.

snap -. ; always something.

I am going to upload fresh build/files to separate repo.
Going to add notice about CUDA issue.

Extra big thanks for this ;

NOTE: Unknown why this issue affects your quants and not local quants (??)

mradermacher

Owner 1 day ago

Still downloading here; if the IMATs fail due to weight issue, I will re-build/re-upload; separate repo & Version #.

They already failed, but again, I was not clear about it :)

--- > It seems to be a CUDA problem.

Maybe when generating the weights, but not when quantizing. No cuda is involved, it's all CPU. It is possible that when finetuning, one of the weights got destroyed (due to overflow + extra magic or so, I am not clear on how that would happen).

It's also a bit of a contentious issue - llama.cpp developers say that if there is a nan, the model cannot generate output (once nan, always nan), but clearly this is not true, as witnessed (not only) by this model, which is why they refuse the model by default.

Either the nan is irrelevant because it is somehow not used during computations, or something weirder is going on.

But unless you find a way to somehow generate different weights without this issue, the problem will likely persist.

Now, obviously, the model itself is not fully broken. In fact it works quite well.

We can always provide static quants only (some of which might not load when the check is on). We could presumably override the nan check, too, although that is unlikely to help, but it might be worth a try.

Of course, I was ordered in no unclear words by the llama.cpp devs to not distribute these files under any circumstances. Of course I don't care.

mradermacher

Owner 1 day ago

NOTE: Unknown why this issue affects your quants and not local quants (??)

I have not seen any issue that affects our quants but not yours. Our quants work fine, the only person having issues seems to be you. What do you mean with "this issue"?

mradermacher

Owner 1 day ago

@nicoboss actually, llama-imatrix says:

--check-tensors check model tensor data for invalid values (default: false)

So I suspect the nan check is hardwired and can't be disabled.

DavidAU

1 day ago

I have not run into an issue before where one "Cuda" runtime works, but another does not.
Generally you get a few more t/s depending on which you use.

Hell; I have sliced entire layers from models wholesale and they still freaking run.

Also; and Lllamacpp seems to be some issue with binaries being out of sync with releases. (??)
(talk of this in "issues" areas at llamacpp github)

When I quant locally, I use llama-quantize ; but it is "cuda" 12.4 version ; shouldn't make any difference ; should be like the "cpu" version of "llama-quantize".

Only llama-server.exe, imatrix.exe and perplexity etc are "cuda" specific... well supposed to be.
And never had any issue with your quants either ; and I have downloaded 100s.

Going to upload new source tonight local time. [built]
Hopefully will be able to see if any weights issues persist or fixed.

Running battery of tests first; see if I can duplicate the problem.
Going to try llama-quantize "cpu" version too; cover all bases.

Sorry again for all the drama;