Sampling parameters from the github
Just in case this is helpful in the absence of official recommended sampling parameters, here's what I found in the repo:
Sampling parameters are listed for inference with this model in several files.
Default Parameters
The default sampling parameters for the model are specified in the models/generation_config.json
file. These include:
{
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8,
"repetition_penalty": 1.05
}
Inference Examples
You can find these parameters being used in various scripts:
In
agent/excel_demo/demo.py
andagent/mcp_demo/demo.py
, theclient.chat.completions.create
method is called with:{ "temperature": 0.5, "top_k": 20, "top_p": 0.7, "repetition_penalty": 1.05 }
In
examples/eval_demo_vllm.py
, the parameters are:{ "temperature": 0.7, "top_k": 20, "top_p": 0.6, "repetition_penalty": 1.05 }
In
inference/openapi.sh
, acurl
command uses:{ "temperature": 0.7, "top_k": 20, "top_p": 0.6, "repetition_penalty": 1.05 }
EDIT: Right now the only way to stop it from thinking in llama-server is with /no_think (I tested someone else's gguf).
Also, it looks to have hybrid-reasoning just like Qwen:
Our model defaults to using slow-thinking reasoning, and there are two ways to disable CoT reasoning:
- Pass
enable_thinking=False
when callingapply_chat_template
.- Adding
/no_think
before the prompt will force the model not to use CoT reasoning. Similarly, adding/think
before the prompt will force the model to perform CoT reasoning.
Besides using /no_think and /think, if it works just like Qwen in llama-server
, you should be able to disable
reasoning like this:
--jinja ^
--reasoning-budget 0 ^
--reasoning-format none ^
And to enable
it:
--jinja ^
--reasoning-budget -1 ^
--reasoning-format none ^
But I will have to test it to be absolutely sure once we have the quants up :)
P.S. - I'm just providing this to hopefully help with your "how to run Hunyuan" page, if you plan to make one for this bad boy, and should you choose to use any of this info for the page at all.
I made quants sorry on the delay - I was confirming why the model had a huge perplexity score (180 upwards). I verifed the quants we just uploaded should be fine. Please use:
./llama.cpp/llama-cli -hf unsloth/Hunyuan-A13B-Instruct-GGUF:Q4_K_XL -ngl 99 --jinja --temp 0.7 --top-k 20 --top-p 0.8 --repeat-penalty 1.05