GGUF

[BUG] Failed to load the model

#2
by metsavana - opened

Hello! I have downloaded the Noromaid-v0.1-mixtral-8x7b.q4_k_m.gguf model and tried to run in the latest text-generation-webui v3.6.1. But I'm getting error. Is there any way to fix this?

Below is the output from console.

'''
21:42:23-781941 INFO Loading "Noromaid-v0.1-mixtral-8x7b.q4_k_m.gguf"
21:42:23-786444 INFO Using gpu_layers=5 | ctx_size=8192 | cache_type=fp16
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes
build: 1 (b548c7c) with MSVC 19.29.30159.0 for x64
system info: n_threads = 12, n_threads_batch = 12, total_threads = 20

system_info: n_threads = 12 (n_threads_batch = 12) / 20 | CUDA : ARCHS = 500,520,530,600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 54239, http threads: 19
main: loading model
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5070) - 10931 MiB free
llama_model_loader: loaded meta data with 23 key-value pairs and 995 tensors from user_data\models\Noromaid-v0.1-mixtral-8x7b.q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Noromaid-v0.1-mixtral-8x7b
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.expert_count u32 = 8
llama_model_loader: - kv 10: llama.expert_used_count u32 = 2
llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: general.file_type u32 = 15
llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 32 tensors
llama_model_loader: - type q8_0: 64 tensors
llama_model_loader: - type q4_K: 833 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 24.62 GiB (4.53 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1637 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 14336
print_info: n_expert = 8
print_info: n_expert_used = 2
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 8x7B
print_info: model params = 46.70 B
print_info: general.name = Noromaid-v0.1-mixtral-8x7b
print_info: vocab type = SPM
print_info: n_vocab = 32000
print_info: n_merges = 0
print_info: BOS token = 1 ''
print_info: EOS token = 2 '
'
print_info: UNK token = 0 ''
print_info: LF token = 13 '<0x0A>'
print_info: EOG token = 2 ''
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
llama_model_load: error loading model: missing tensor 'blk.0.ffn_down_exps.weight'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'user_data\models\Noromaid-v0.1-mixtral-8x7b.q4_k_m.gguf'
main: exiting due to model loading error
21:42:26-828093 ERROR Error loading the model with llama.cpp: Server process terminated unexpectedly with exit code:
1
'''

NeverSleep org

You are aware that this model is one year old at this point?

@metsavana Indeed, you'd need an older version of llama.cpp, as llama.cpp broke compatibility with essentially all older mixtral models. If you are interested in this model, it's no issue for us to redo it so it works with modern tools (and probably at higher quality). Just say the word.

@metsavana Indeed, you'd need an older version of llama.cpp, as llama.cpp broke compatibility with essentially all older mixtral models. If you are interested in this model, it's no issue for us to redo it so it works with modern tools (and probably at higher quality). Just say the word.

I was honestly not aware of this and the fact that the model was over a year old. Sorry about that. I should've paid more attention.

I will be very grateful if you could update the model. I would definitely use the GGUF version over the original GPTQ, because for some reason my setup doesn't like the latter format.

Actually, we already re-did this model 5 months ago, and it does load in current llama.cpp, although I had to enable jinja support for the model template:

https://hf.tst.eu/model#Noromaid-v0.1-mixtral-8x7b-Instruct-v3-i1-GGUF

Sign up or log in to comment