regex_error(error_badrepeat)

#2
by newsletter - opened

The Q8_0 quant cannot be loaded with llama.cpp b5010 (latest, CPU-only) because of the following error:

build: 5010 (a8a1f335) with MSVC 19.43.34808.0 for x64
system info: n_threads = 6, n_threads_batch = 12, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 12) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8081, http threads: 11
main: loading model
srv    load_model: loading model 'llm\Ling-lite\Ling-lite.Q8_0.gguf'
llama_model_loader: loaded meta data with 44 key-value pairs and 367 tensors from llm\Ling-lite\Ling-lite.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bailingmoe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Ling Lite
llama_model_loader: - kv   3:                         general.size_label str              = 64x1.5B
llama_model_loader: - kv   4:                            general.license str              = mit
llama_model_loader: - kv   5:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   6:                     bailingmoe.block_count u32              = 28
llama_model_loader: - kv   7:                  bailingmoe.context_length u32              = 16384
llama_model_loader: - kv   8:                bailingmoe.embedding_length u32              = 2048
llama_model_loader: - kv   9:             bailingmoe.feed_forward_length u32              = 5632
llama_model_loader: - kv  10:            bailingmoe.attention.head_count u32              = 16
llama_model_loader: - kv  11:         bailingmoe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                  bailingmoe.rope.freq_base f32              = 600000.000000
llama_model_loader: - kv  13: bailingmoe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:               bailingmoe.expert_used_count u32              = 6
llama_model_loader: - kv  15:            bailingmoe.rope.dimension_count u32              = 128
llama_model_loader: - kv  16:               bailingmoe.rope.scaling.type str              = none
llama_model_loader: - kv  17:       bailingmoe.leading_dense_block_count u32              = 0
llama_model_loader: - kv  18:                      bailingmoe.vocab_size u32              = 126464
llama_model_loader: - kv  19:      bailingmoe.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  20:            bailingmoe.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  21:                    bailingmoe.expert_count u32              = 64
llama_model_loader: - kv  22:             bailingmoe.expert_shared_count u32              = 2
llama_model_loader: - kv  23:             bailingmoe.expert_weights_norm bool             = true
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = bailingmoe
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,126464]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,126464]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,125824]  = ["─á ─á", "─á t", "i n", "─á a", "h e...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 126080
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 126081
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 126081
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {% for message in messages %}{% set r...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - kv  36:                          general.file_type u32              = 7
llama_model_loader: - kv  37:                                general.url str              = https://huggingface.co/mradermacher/L...
llama_model_loader: - kv  38:              mradermacher.quantize_version str              = 2
llama_model_loader: - kv  39:                  mradermacher.quantized_by str              = mradermacher
llama_model_loader: - kv  40:                  mradermacher.quantized_at str              = 2025-03-31T05:06:41+02:00
llama_model_loader: - kv  41:                  mradermacher.quantized_on str              = kaos
llama_model_loader: - kv  42:                         general.source.url str              = https://huggingface.co/inclusionAI/Li...
llama_model_loader: - kv  43:                  mradermacher.convert_type str              = hf
llama_model_loader: - type  f32:   85 tensors
llama_model_loader: - type q8_0:  282 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 16.64 GiB (8.51 BPW)
Failed to process regex: ''(?:[sSdDmMtT]|[lL][lL]|[vV][eE]|[rR][eE])|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+'
Regex error: regex_error(error_badrepeat): One of *?+{ was not preceded by a valid regular expression.
llama_model_load: error loading model: error loading model vocabulary: Failed to process regex
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'llm\Ling-lite\Ling-lite.Q8_0.gguf'

Is this a bug in llama.cpp or does this quantization have a flaw?

Yeah, as usual, llama.cpp added support for a model but apparently didn't even bother to try it with any of the actual models (it seems all are pretty much broken). I'll investigate tomorrow, probably this repo will just go away.

Yeah, as usual, llama.cpp added support for a model but apparently didn't even bother to try it with any of the actual models (it seems all are pretty much broken). I'll investigate tomorrow, probably this repo will just go away.

That's harsh, I and several others did test, and did not get this error, there is definitely something wrong though as imatrix tokenization hangs, investigating...

CISC who implemented support for BailingMoE in llama.cpp is aware of this issue. Please follow https://github.com/ggml-org/llama.cpp/pull/12634 for further information. I assume this can be fixed using a future llama.cpp update and if not we will requantize this model as soon this issue got fixed.

Edit: Oh wow he was faster at responding than me :D

It is worth mentioning that using llama-server I'm unable to recreate this issue myself so far. By the way big thanks to CISCai for implementing support for this amazing models. Without him llama.cpp might never would have supported it. I really appreciate his work. I regret not having better tested the PR myself. I wrongly assumed everything beside Ling-plus was already well tested.

Hopefully fixed in llama.cpp#12677

@CISCai

That's harsh

But almost certainly true - if this was really tested I apologize. But it is honestly hard to believe that it was tested when it fails (differently) with practically every single Ling* model (especially since this happens every single time: the recently added PLM support also doesn't work with the PLM model itself, and it was the same with other models in the past). I understand when it fails with finetunes by others (or the patch author can't test it with e.g. the big models), but the models it was specifically meant to support?

You probably tested your patch, and it's fine if there are problems - the problem is that llama.cpp has essentially no quality control. Remember, my comments were not aimed at you, but llama.cpp as a whole. I'm hard pressed to find a project which is similarly shoddily maintained in the free software world (and similar, even even much less, exposure).

In any case, looking at the fix, it seems the quants themselves should be fine.

But almost certainly true - if this was really tested I apologize. But it is honestly hard to believe that it was tested when it fails (differently) with practically every single Ling* model (especially since this happens every single time: the recently added PLM support also doesn't work with the PLM model itself, and it was the same with other models in the past). I understand when it fails with finetunes by others (or the patch author can't test it with e.g. the big models), but the models it was specifically meant to support?

Well, in this case the problem was a certain regex feature (from the original tokenizer) that for some was not supported at all, and for others was not supported correctly. If you are not aware, regex is a shitshow across implementations, depending on your setup it will behave differently, as clearly was the case here, and really hard to properly test!

You probably tested your patch, and it's fine if there are problems - the problem is that llama.cpp has essentially no quality control. Remember, my comments were not aimed at you, but llama.cpp as a whole. I'm hard pressed to find a project which is similarly shoddily maintained in the free software world (and similar, even even much less, exposure).

Given the size and multitude of models you can't expect reviewers to do model testing, it has to be the submitter and volunteers/community (you) that bears the brunt of this.

As long as you report any issues I'm sure the submitter and/or others are willing to fix them, at least I know I am.

So, having said that, what was the problem with PLM? :)

It works with the latest llama.cpp release. Thank you all!

newsletter changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment