Qwen
/

Text Generation
GGUF
conversational
feihu.hf commited on
Commit
7827f3b
·
1 Parent(s): 5cc01d4

update README

Browse files
Files changed (2) hide show
  1. README.md +14 -9
  2. params +13 -0
README.md CHANGED
@@ -1,12 +1,8 @@
1
  ---
2
  license: apache-2.0
3
  license_link: https://huggingface.co/Qwen/Qwen3-32B-GGUF/blob/main/LICENSE
4
- language:
5
- - en
6
  pipeline_tag: text-generation
7
- base_model: Qwen/Qwen3-32B-GGUF
8
- tags:
9
- - chat
10
  ---
11
 
12
  # Qwen3-32B-GGUF
@@ -42,6 +38,8 @@ For more details, including benchmark evaluation, hardware requirements, and inf
42
 
43
  ## Quickstart
44
 
 
 
45
  Check out our [llama.cpp documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html) for more usage guide.
46
 
47
  We advise you to clone [`llama.cpp`](https://github.com/ggerganov/llama.cpp) and install it following the official guide. We follow the latest version of llama.cpp.
@@ -51,6 +49,16 @@ In the following demonstration, we assume that you are running commands under th
51
  ./llama-cli -hf Qwen/Qwen3-32B:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 40960 -n 32768 --no-context-shift
52
  ```
53
 
 
 
 
 
 
 
 
 
 
 
54
  ## Switching Between Thinking and Non-Thinking Mode
55
 
56
  You can add `/think` and `/no_think` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
@@ -80,7 +88,7 @@ The word strawberries contains 3 instances of the letter r. [...]
80
 
81
  Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the [YaRN](https://arxiv.org/abs/2309.00071) method.
82
 
83
- To enable YARN:
84
 
85
  ```shell
86
  ./llama-cli ... -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
@@ -91,9 +99,6 @@ To enable YARN:
91
  > We advise adding the `rope_scaling` configuration only when processing long contexts is required.
92
  > It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
93
 
94
- > [!NOTE]
95
- > The default `max_position_embeddings` in `config.json` is set to 40,960. This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing. If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
96
-
97
  > [!TIP]
98
  > The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed.
99
 
 
1
  ---
2
  license: apache-2.0
3
  license_link: https://huggingface.co/Qwen/Qwen3-32B-GGUF/blob/main/LICENSE
 
 
4
  pipeline_tag: text-generation
5
+ base_model: Qwen/Qwen3-32B
 
 
6
  ---
7
 
8
  # Qwen3-32B-GGUF
 
38
 
39
  ## Quickstart
40
 
41
+ ### llama.cpp
42
+
43
  Check out our [llama.cpp documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html) for more usage guide.
44
 
45
  We advise you to clone [`llama.cpp`](https://github.com/ggerganov/llama.cpp) and install it following the official guide. We follow the latest version of llama.cpp.
 
49
  ./llama-cli -hf Qwen/Qwen3-32B:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 40960 -n 32768 --no-context-shift
50
  ```
51
 
52
+ ### ollama
53
+
54
+ Check out our [ollama documentation](https://qwen.readthedocs.io/en/latest/run_locally/ollama.html) for more usage guide.
55
+
56
+ You can run Qwen3 with one command:
57
+
58
+ ```shell
59
+ ollama run hf.co/Qwen/Qwen3-32B-GGUF:Q8_0
60
+ ```
61
+
62
  ## Switching Between Thinking and Non-Thinking Mode
63
 
64
  You can add `/think` and `/no_think` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
 
88
 
89
  Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the [YaRN](https://arxiv.org/abs/2309.00071) method.
90
 
91
+ To enable YARN in ``llama.cpp``:
92
 
93
  ```shell
94
  ./llama-cli ... -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
 
99
  > We advise adding the `rope_scaling` configuration only when processing long contexts is required.
100
  > It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
101
 
 
 
 
102
  > [!TIP]
103
  > The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed.
104
 
params ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "stop": [
3
+ "<|im_start|>",
4
+ "<|im_end|>"
5
+ ],
6
+ "temperature": 0.6,
7
+ "min_p" : 0.00,
8
+ "repeat_penalty" : 1.0,
9
+ "presence_penalty" : 1.5,
10
+ "top_k" : 20,
11
+ "top_p" : 0.95,
12
+ "num_predict" : 32768
13
+ }