LM Studio with GGUF

by qixing - opened 7 days ago

7 days ago

The GGUF model is sourced from mradermacher/Seed-X-Instruct-7B-GGUF(Q8). After loading the model, I tried using the Chat interface to call it with the following prompt:

"Translate the following English sentence into Chinese:\nMay the force be with you "

However, the response I received was garbled text.

Is this issue caused by incorrect configuration settings, or does the GGUF model simply not support chat-based interactions?"

Doctor-Chad-PhD

7 days ago

I'm getting the same issue on plain llama.cpp (using llama-cli)
Q6_0

kevinpro

7 days ago

This quantized model isn't an official release. The issues might stem from the quantization process itself. We would recommend use official released model weight and use the BF16 to inference

kkioikk

5 days ago

既然GGUF还有问题，就暂时不试了。我猜翻译效果不一定会比deepseek-v2-lite-chat 模型更好，速度也肯定没有deepseek-v2-lite-chat更快。

kevinpro

5 days ago

•

edited 5 days ago

既然GGUF还有问题，就暂时不试了。我猜翻译效果不一定会比deepseek-v2-lite-chat 模型更好，速度也肯定没有deepseek-v2-lite-chat更快。

Thank you for your interest in our model.

I must address a key factual point in your comparison. As we've detailed in both our README and the technical report https://arxiv.org/pdf/2507.13618 , our model has been shown to outperform DeepSeek's current largest and most powerful model, DeepSeek-V3/R1 (671B), on both automatic benchmarks and human evaluations. It's important to consider that our model has only 7B parameters. For a model of this size, its inference speed is highly competitive.

We are actively working on providing official quantized versions of the model. Stay tuned!

YuLu0713

ByteDance Seed org 5 days ago

@Doctor-Chad-PhD @qixing Thank you for trying. We will release the official quantized version as soon as possible.

scatiel

4 days ago

•

edited 4 days ago

Actually, vllm can run Q4 GGUF from mradermacher and it performs great. But llama.cpp and LM Studio just do not perform at all, both, ppo and instruct, with both, /v1/completions and /v1/chat/completions endpoints. vllm vram overhead is a joke, so any chances to run it in llama.cpp/LM Studio?

Doctor-Chad-PhD

4 days ago

•

edited 4 days ago

It works a bit better when you create a special_tokens_map.json file in the same directory before quantizing it (but it's not the complete fix):

{
  "bos_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
}

Output:

 Translate the following English sentence into Chinese:
Hello! <zh> 你好！ [end of text]

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment