Sampling parameters from the github

by YearZero - opened 1 day ago

1 day ago

•

Just in case this is helpful in the absence of official recommended sampling parameters, here's what I found in the repo:

Sampling parameters are listed for inference with this model in several files.

Default Parameters
The default sampling parameters for the model are specified in the models/generation_config.json file. These include:

{
  "temperature": 0.7,
  "top_k": 20,
  "top_p": 0.8,
  "repetition_penalty": 1.05
}

Inference Examples
You can find these parameters being used in various scripts:

In agent/excel_demo/demo.py and agent/mcp_demo/demo.py, the client.chat.completions.create method is called with:
```
{
  "temperature": 0.5,
  "top_k": 20,
  "top_p": 0.7,
  "repetition_penalty": 1.05
}
```

In examples/eval_demo_vllm.py, the parameters are:

{
  "temperature": 0.7,
  "top_k": 20,
  "top_p": 0.6,
  "repetition_penalty": 1.05
}

In inference/openapi.sh, a curl command uses:

{
  "temperature": 0.7,
  "top_k": 20,
  "top_p": 0.6,
  "repetition_penalty": 1.05
}

YearZero

1 day ago

•

edited 1 day ago

EDIT: Right now the only way to stop it from thinking in llama-server is with /no_think (I tested someone else's gguf).

Also, it looks to have hybrid-reasoning just like Qwen:

Our model defaults to using slow-thinking reasoning, and there are two ways to disable CoT reasoning:

Pass enable_thinking=False when calling apply_chat_template.

Adding /no_think before the prompt will force the model not to use CoT reasoning. Similarly, adding /think before the prompt will force the model to perform CoT reasoning.

Besides using /no_think and /think, if it works just like Qwen in llama-server, you should be able to disable reasoning like this:

--jinja ^
--reasoning-budget 0 ^
--reasoning-format none ^

And to enable it:

--jinja ^
--reasoning-budget -1 ^
--reasoning-format none ^

But I will have to test it to be absolutely sure once we have the quants up :)

P.S. - I'm just providing this to hopefully help with your "how to run Hunyuan" page, if you plan to make one for this bad boy, and should you choose to use any of this info for the page at all.

danielhanchen

Unsloth AI org about 22 hours ago

I made quants sorry on the delay - I was confirming why the model had a huge perplexity score (180 upwards). I verifed the quants we just uploaded should be fine. Please use:

./llama.cpp/llama-cli -hf unsloth/Hunyuan-A13B-Instruct-GGUF:Q4_K_XL -ngl 99 --jinja --temp 0.7 --top-k 20 --top-p 0.8 --repeat-penalty 1.05

shimmyshimmer

Unsloth AI org about 18 hours ago

Btw guys we uploaded the GGUFs now!

CC: @YearZero @owao @engrtipusultan @itsthenewmeta

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment