feihu.hf
commited on
Commit
·
7827f3b
1
Parent(s):
5cc01d4
update README
Browse files
README.md
CHANGED
@@ -1,12 +1,8 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
license_link: https://huggingface.co/Qwen/Qwen3-32B-GGUF/blob/main/LICENSE
|
4 |
-
language:
|
5 |
-
- en
|
6 |
pipeline_tag: text-generation
|
7 |
-
base_model: Qwen/Qwen3-32B
|
8 |
-
tags:
|
9 |
-
- chat
|
10 |
---
|
11 |
|
12 |
# Qwen3-32B-GGUF
|
@@ -42,6 +38,8 @@ For more details, including benchmark evaluation, hardware requirements, and inf
|
|
42 |
|
43 |
## Quickstart
|
44 |
|
|
|
|
|
45 |
Check out our [llama.cpp documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html) for more usage guide.
|
46 |
|
47 |
We advise you to clone [`llama.cpp`](https://github.com/ggerganov/llama.cpp) and install it following the official guide. We follow the latest version of llama.cpp.
|
@@ -51,6 +49,16 @@ In the following demonstration, we assume that you are running commands under th
|
|
51 |
./llama-cli -hf Qwen/Qwen3-32B:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 40960 -n 32768 --no-context-shift
|
52 |
```
|
53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
## Switching Between Thinking and Non-Thinking Mode
|
55 |
|
56 |
You can add `/think` and `/no_think` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
|
@@ -80,7 +88,7 @@ The word strawberries contains 3 instances of the letter r. [...]
|
|
80 |
|
81 |
Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the [YaRN](https://arxiv.org/abs/2309.00071) method.
|
82 |
|
83 |
-
To enable YARN
|
84 |
|
85 |
```shell
|
86 |
./llama-cli ... -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
|
@@ -91,9 +99,6 @@ To enable YARN:
|
|
91 |
> We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
92 |
> It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
|
93 |
|
94 |
-
> [!NOTE]
|
95 |
-
> The default `max_position_embeddings` in `config.json` is set to 40,960. This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing. If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
|
96 |
-
|
97 |
> [!TIP]
|
98 |
> The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed.
|
99 |
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
license_link: https://huggingface.co/Qwen/Qwen3-32B-GGUF/blob/main/LICENSE
|
|
|
|
|
4 |
pipeline_tag: text-generation
|
5 |
+
base_model: Qwen/Qwen3-32B
|
|
|
|
|
6 |
---
|
7 |
|
8 |
# Qwen3-32B-GGUF
|
|
|
38 |
|
39 |
## Quickstart
|
40 |
|
41 |
+
### llama.cpp
|
42 |
+
|
43 |
Check out our [llama.cpp documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html) for more usage guide.
|
44 |
|
45 |
We advise you to clone [`llama.cpp`](https://github.com/ggerganov/llama.cpp) and install it following the official guide. We follow the latest version of llama.cpp.
|
|
|
49 |
./llama-cli -hf Qwen/Qwen3-32B:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 40960 -n 32768 --no-context-shift
|
50 |
```
|
51 |
|
52 |
+
### ollama
|
53 |
+
|
54 |
+
Check out our [ollama documentation](https://qwen.readthedocs.io/en/latest/run_locally/ollama.html) for more usage guide.
|
55 |
+
|
56 |
+
You can run Qwen3 with one command:
|
57 |
+
|
58 |
+
```shell
|
59 |
+
ollama run hf.co/Qwen/Qwen3-32B-GGUF:Q8_0
|
60 |
+
```
|
61 |
+
|
62 |
## Switching Between Thinking and Non-Thinking Mode
|
63 |
|
64 |
You can add `/think` and `/no_think` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
|
|
|
88 |
|
89 |
Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the [YaRN](https://arxiv.org/abs/2309.00071) method.
|
90 |
|
91 |
+
To enable YARN in ``llama.cpp``:
|
92 |
|
93 |
```shell
|
94 |
./llama-cli ... -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
|
|
|
99 |
> We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
100 |
> It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
|
101 |
|
|
|
|
|
|
|
102 |
> [!TIP]
|
103 |
> The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed.
|
104 |
|
params
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"stop": [
|
3 |
+
"<|im_start|>",
|
4 |
+
"<|im_end|>"
|
5 |
+
],
|
6 |
+
"temperature": 0.6,
|
7 |
+
"min_p" : 0.00,
|
8 |
+
"repeat_penalty" : 1.0,
|
9 |
+
"presence_penalty" : 1.5,
|
10 |
+
"top_k" : 20,
|
11 |
+
"top_p" : 0.95,
|
12 |
+
"num_predict" : 32768
|
13 |
+
}
|