Qwen3-Zero-Coder-Reasoning-0.8B-NEO-EX-GGUF

This is a coder/programming model, will full reasoning on the Qwen 3 platform that is insanely fast - hitting over 150 t/s on moderate hardware, and 50 t/s+ on CPU only...
This is a generalist coding model, good for code blocks, brainstorming coding ideas, and generating draft code fast.
With reasoning, it can also handle complex code requests too.
It contains 42 layers (a merge of TWO 0.6B models), and 464 tensors - a very dense model for this size.
The GGUFs have been augmented with the NEO Imatrix dataset- including the Q8s, F16s, and BF16s (NEO2, NEO3).
There are THREE versions of NEO GGUFs in this repo as well, to take advantage of the unique properties of this model.
As odd as this sounds, lower to mid quants work best because of the stronger Imatrix effect in these quants for some use cases (see below).
Model can code better, and seems to make better decisions (rather than hesitating a lot) and sometimes generates SMALLER reasoning blocks [1/4 to 1/2 the size].
Likewise, lower quants often come up with "outside the box" solutions and/or less complex - but nevertheless working solutions.
Higher quants work well, but can make generate longer reasoning blocks, HOWEVER in some cases come up with better solutions (relative to smaller quants).
For these reasons I suggest you download at least 2 quants and compare operations for your use case(s).
IQ3_M will work well for many use cases, at over 150 T/S ; IQ4s/Q4s are the best of Imatrix with maximum bits (balanced) ; Q8 is very strong, and BF16 and F16 are at full power (see special notes on BF16 vs F16 below).
For NEO ggufs:
- Standard NEO Imatrix GGUFs.
- Q8, F16, BF16 are NOT imatrixed, nor contain Imatrixed tensors/elements.
For NEO2 ggufs ("NEO2" in the filenames):
- GGufs Imatrixed AND the output tensor was Imatrixed.
- For Q8, F16, BF16: set at Q6_k (imatrixed).
For NEO3 ggufs ("NEO3" in the filenames):
- GGufs Imatrixed AND the output tensor was Imatrixed, and set at IQ4_XS for all quants (including Q8, F16, BF16).
Operation:
This model will generate "reasoning block(s)" to solve your coding problem.
Good directions, with "dos" and "don'ts" will yield the best results WITH DETAILED PROMPT(s).
I suggest 2-3 generations for best results AND/OR 2-3 generations on 2 different quants IE IQ3_M and Q5KM/Q6/Q8.
I find the 2nd (and on) generations are better than the first, even if you open a new chat for it.
That being said, this model can repeat code blocks from time to time (most commonly in higher quants, but the model can also generation multiple editions of the code too in these cases), and/or need to be manually stopped.
These issues are present in other Qwen models of this size.
Althought there are settings to address this - samplers/parameters - this can have a negative effect on code generation.
For lower quants (IQ2s, Q2s, and lower IQ3s):
- Increase the details in your instructions.
- Suggest 2-4 generations for best results.
Quant Advice:
Although ususually the advice is to use the biggest quant you can, in this case smaller quants - IQ3_M, Q4s, IQ4s - may yield better results in some use cases.
This is due in part to the Neo Imatrix dataset (the dataset has a STRONGER effect inverse to the quant size).
Note that the highest quants operate really well, but tend to get "lost in the woods" more.
To address this:
- Add additional details and conditions in your prompt to FOCUS the model on the core problems.
- If, during generation, it appears the model is "getting lost on details" : stop generation, and regenerate.
Q8s, F16, BF16
There are three each of these.
First set are normal, second set (NEO2) have the output tensor set at Q6 (which is also imatrixed), and the third set (NEO3) has the output tensor set at IQ4_XS (which is also imatrixed).
F16 or BF16 ?
Interestingly, F16 works better than BF16 in some cases. This is odd because the native source code is in BF16, and when you convert to F16 there is a slight rounding effect.
Settings:
This model requires:
- Jinja (embedded) or CHATML template
- Max context of 40k.
- Suggest min context of 8k to 16k.
Settings used for testing (suggested):
- Temp .3 to .7
- Rep pen 1.05 to 1.1
- Topp .8 , minp .05
- Topk 20
- No system prompt.
Settings used for testing #2 (suggested):
- Temp .55
- Rep pen 1.05
- Topp .95 , minp .05
- Topk 100
- No system prompt.
Settings used for testing #3 (suggested - my fav):
- Temp .6
- Rep pen 1.1
- Topp .95 , minp .0
- Topk 20
- No system prompt.
This model will respond well to both detailed instructions and step by step refinement and additions to code.
As this is an instruct model, it will also benefit from a detailed system prompt too.
For simpler coding problems, lower quants will work well; but for complex/multi-step problem solving suggest Q6 or Q8.
With this model, you should use statements to tell it what you want and want to disallow to help keep this model on track.
For more information / other Qwen/Mistral Coders / additional settings see:
[ https://huggingface.co/DavidAU/Qwen2.5-MOE-2x-4x-6x-8x__7B__Power-CODER__19B-30B-42B-53B-gguf ]
Help, Adjustments, Samplers, Parameters and More
CHANGE THE NUMBER OF ACTIVE EXPERTS:
See this document:
https://huggingface.co/DavidAU/How-To-Set-and-Manage-MOE-Mix-of-Experts-Model-Activation-of-Experts
Settings: CHAT / ROLEPLAY and/or SMOOTHER operation of this model:
In "KoboldCpp" or "oobabooga/text-generation-webui" or "Silly Tavern" ;
Set the "Smoothing_factor" to 1.5
: in KoboldCpp -> Settings->Samplers->Advanced-> "Smooth_F"
: in text-generation-webui -> parameters -> lower right.
: In Silly Tavern this is called: "Smoothing"
NOTE: For "text-generation-webui"
-> if using GGUFs you need to use "llama_HF" (which involves downloading some config files from the SOURCE version of this model)
Source versions (and config files) of my models are here:
OTHER OPTIONS:
Increase rep pen to 1.1 to 1.15 (you don't need to do this if you use "smoothing_factor")
If the interface/program you are using to run AI MODELS supports "Quadratic Sampling" ("smoothing") just make the adjustment as noted.
Highest Quality Settings / Optimal Operation Guide / Parameters and Samplers
This a "Class 1" model:
For all settings used for this model (including specifics for its "class"), including example generation(s) and for advanced settings guide (which many times addresses any model issue(s)), including methods to improve model performance for all use case(s) as well as chat, roleplay and other use case(s) please see:
You can see all parameters used for generation, in addition to advanced parameters and samplers to get the most out of this model here:
- Downloads last month
- 424
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for DavidAU/Qwen3-Zero-Coder-Reasoning-0.8B-NEO-EX-GGUF
Base model
DavidAU/Qwen3-Zero-Coder-Reasoning-0.8B