Safetensors
mistral

LM Studio with GGUF

#1
by qixing - opened

The GGUF model is sourced from mradermacher/Seed-X-Instruct-7B-GGUF(Q8). After loading the model, I tried using the Chat interface to call it with the following prompt:

"Translate the following English sentence into Chinese:\nMay the force be with you "

However, the response I received was garbled text.
image.png
Is this issue caused by incorrect configuration settings, or does the GGUF model simply not support chat-based interactions?"

I'm getting the same issue on plain llama.cpp (using llama-cli)
Q6_0

This quantized model isn't an official release. The issues might stem from the quantization process itself. We would recommend use official released model weight and use the BF16 to inference

既然GGUF还有问题,就暂时不试了。我猜翻译效果不一定会比deepseek-v2-lite-chat 模型更好,速度也肯定没有deepseek-v2-lite-chat更快。

既然GGUF还有问题,就暂时不试了。我猜翻译效果不一定会比deepseek-v2-lite-chat 模型更好,速度也肯定没有deepseek-v2-lite-chat更快。

Thank you for your interest in our model.

I must address a key factual point in your comparison. As we've detailed in both our README and the technical report https://arxiv.org/pdf/2507.13618 , our model has been shown to outperform DeepSeek's current largest and most powerful model, DeepSeek-V3/R1 (671B), on both automatic benchmarks and human evaluations. It's important to consider that our model has only 7B parameters. For a model of this size, its inference speed is highly competitive.

We are actively working on providing official quantized versions of the model. Stay tuned!

ByteDance Seed org

@Doctor-Chad-PhD @qixing Thank you for trying. We will release the official quantized version as soon as possible.

Actually, vllm can run Q4 GGUF from mradermacher and it performs great. But llama.cpp and LM Studio just do not perform at all, both, ppo and instruct, with both, /v1/completions and /v1/chat/completions endpoints. vllm vram overhead is a joke, so any chances to run it in llama.cpp/LM Studio?

It works a bit better when you create a special_tokens_map.json file in the same directory before quantizing it (but it's not the complete fix):

{
  "bos_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
}

Output:

 Translate the following English sentence into Chinese:
Hello! <zh> 你好! [end of text]

Sign up or log in to comment