Why many small files?
I'm confused why so many little bits of a file, instead of one 30GB file? I don't think my software can work with that?
it can you just merge them, as claude 3.7 reasoning how to do as it smarteset AGI/ASI LLM atm but yh really unesscary to split a q2 into 4 files
You can use llama.cpp for this, use the llama-gguf-split binary you find in the root of the repo:./llama-gguf-split --merge qwq-32b-q4_k_m-00001-of-00005.gguf qwq-32b-q4_k_m.gguf
You can give only the first part and it automatically find the rest for you.
Just to add: a simple ls -v qwq-32b-q4_k_m-0000* | xargs cat > qwq-32b-q4_k_m.gguf
won't produce a correct output gguf.
I just use Backyard, which runs on llama.cpp I think, and it doesn't like split files. I found this: https://huggingface.co/bartowski/Qwen_QwQ-32B-GGUF/tree/main Seems to work :)
It's not hard to merge them but it is very inconvenient to download them
But at least now each part can be stored on a single-sided single-layered DVD disc or a USB stick with FAT32
It's not hard to merge them but it is very inconvenient to download them
But at least now each part can be stored on a single-sided single-layered DVD disc or a USB stick with FAT32
Are you planning to do this?
@owao of course not, im joking
Hahaha that was really funny! I had a good laugh yeah, especially imagining you were actually serious!
Thanks!
If anyone needs a more convenient way to download a certain quant, here is a tiny script feeding filtered urls to aria2c:
Usage: ./hfdl.sh https://huggingface.co/Qwen/QwQ-32B-GGUF "qwq-32b-q4_k_m*"
#!/bin/bash
[ $# -lt 2 ] && { echo "Usage: $0 <huggingface_repo_url> <filename_with_wildcard>"; exit 1; }
REPO_URL=$1; PATTERN=$2
[[ $REPO_URL =~ https://huggingface.co/([^/]+)/([^/]+) ]] && REPO="${BASH_REMATCH[1]}/${BASH_REMATCH[2]}"
DATA=$(curl -s "https://huggingface.co/api/models/$REPO")
# Convert glob to regex: escape dot; replace * with .*, ? with .
REGEX=$(echo "$PATTERN" | sed 's/\./\\./g; s/\*/.*/g; s/\?/./g')
FOUND=0
FILES=$(echo "$DATA" | grep -oE '"r?filename": *"[^"]*"' | cut -d'"' -f4)
for FILE in $FILES; do
if [[ $(basename "$FILE") =~ $REGEX ]]; then
echo "Downloading: $FILE"
aria2c -x 5 -s 5 -j 5 "$REPO_URL/resolve/main/$FILE" -o "$FILE"
FOUND=1
fi
done
[ $FOUND -eq 0 ] && echo "No files found matching: $PATTERN" && exit 1
I tried these quants in hopes they would work since they are official, but no, they are also broken (yes, I do use temp 0.6)
Dude, you can take them here, on ollama, from bartowski, none are broken I don't know what you did, but something is wrong on your side.
maybe your layers have been messed up. Did you merge the splits into 1 file? How did you do it?
@ceoofcapybaras
I get absolutely zero loops.
I'm sorry as I didn't keep any of the ggufs to try out directly through a llama.cpp server, for my everyday use I use ollama which splits the ggufs into many small blobs only named by their hashes. I'm on latest stable (0.5.13) but unfortunately, I don't manage to find which build of llama.cpp it's using :/
But what I can say is that I obtained my ggufs, all in Q4_K_M from several sources:
- ollama
- from here
- and I also quantized from the source safetensors using nexaquant, still in Q4_K_M
Had no issue with any of them, zero loop or any other weird behavior.
I can try to run it with a small quant size directly through llama.cpp tomorrow, but only when you answered how you merged it :D
@owao no, I tried both of them, this is not a local issue, it is discussed here, if some of the quants do not loop on hard tasks and work for you, please tell which ones and what version of llama.cpp are you using
OK, I found I still had the FP16 from my original conversion using nexa.
So I quantized it down to Q4_K_M (still using nexa, but pretty sure the result would be the same with llama.cpp directly).
- I built llama.cpp b4837 (latest) for cuda support with
~/D/llama.cpp-b4837 ❯❯❯ cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86"
(not really sure it was necessary)
then~/D/llama.cpp-b4837 ❯❯❯ cmake --build build --config Release -j 12
- Then I ran a server using
~/D/llama.cpp-b4837 ❯❯❯ ./build/bin/llama-server -m /home/user/QwQ-32B-q4_k_m.gguf --port 8085 -ngl 65 -c 13000 -fa
(can only load 13k max with Q4_K_M instead of 16k with cause nexa quants are a bit larger than regular ones) - I used OpenWebUI as client, using
http://localhost:8085/v1
as endpoint
--> No repetition, no weird behavior. See 3 examples in a row here: https://privatebin.net/?86a5feb70d48eb9a#14rn3Mq1aqD1RfreszVCxwosxxW9aPWBhov8pD5mrxcd
I also tried with simple 'hey!' or oppositely the example given in your redit thread, which lead to a 18k ouptut (so, non sense here, as it forgot the initial instructions before ending, so of course result was trash, but still, no repetition, no weird behavior).
The F16 file I had was already a conversion of the original safetensors using nexa. To be rigorous I'd have to redo all the process using only llama.cpp... But honestly, I think it's fair here to say something is wrong on your side. This nexa quants are just an experimentation, and apart from it, I'm using the original quants (still q4_k_m) from here which are fine too. Even if the model don't loop I might end up doing so writing all this lol
But wait a minute, in the initial instructions, the user actually only asked for the quants versions and the llama.cpp build number. Maybe I should stop here and present my conclusion. But wait, maybe the user wants a more complete answer. I guess they can read up and find it if ever they need it. As I'm now confident enough, I will write only a concise answer to the original question. Wait yes, that seems adapted. Now let's write the final answer.
Solution
Quantization level used: Q4_K_M
Llama.cpp build number: b4837
@owao
sorry, it was a bit late for me and my internet went off, thanks for trying different options
I used your settings and your prompt, and different versions of llama.cpp, tried a few different quants (didn't merge these, just provided the path to part 1)
In the end I think I found the issue, is probably related to llama.cpp's implementation of capping context length and parallel processing, since increasing it to 13k helped enough to answer SOME questions before starting to repeat itself.
With small values like 1024, it starts repeating itself very early. It should be enough to answer a question within 1024 tokens or just hit the length limit before the full answer, which does work fine in AWQ but in llama.cpp it just loops, and turns out --parallel 10 is another argument to blame, context length gets cut by 10, so it was on my side
Other than the test prompt "You have to find 10 sentences that end with the word “apple”.", how long are your other prompts? Usually, if you set something small like 1024 with a challenging prompt, it of course won't end its reasoning within the token window. In that case, the user question becomes non visible anymore and it only sees an ongoing reasoning it is developing. And while I most of the time observed that in such case, the model ends up its reasoning prematurely (I mean it couldn't mature anyway in such a case) at let's say ~1200-1400 tokens, I think a loop behavior could also be another syndrome.
If you feel like sharing, I can try some of your prompts because apart a generation overtaking the token window, I don't see what could cause the disparity from my trials to yours.
Did you run the llama-cli or server with other parameters than -c ?
I started writing my msg before you edited yours. Glad you figured it out. And I learnt something ;)
I just found this, just in case that help you or others on the endless generation issue https://unsloth.ai/blog/qwq-32b
You don't need to merge them if using normal launcher app. All launchers like LM Studio or oobabooga based on llama-cpp supporting loading models from parts files if you just choose the folder of first file, EXCEPT OLLAMA which is behind all industry, it's really strange why they holding back the update implemented in llama-cpp itself (maybe it's about their own strange distribution of models which not working with large models by reason below, Ollama is just the weirdest compared to LM Studio or oobabooga where you don't need to do that stuff with reencoding or merging, where you just drop model in folder and it loads right away).
Model in parts is very normal practice, even important one because downloading original Deepseek R1 model in Ollama by 1 file like they always need of half terabyte - is the adventure for most Internet providers and you can spend at least 2 days minimum, when i downloaded it in parts there was 2 broken attempts (connection reset somewhere in between my provider & huggingface) - i would spend a week if downloading that by 1 file and starting over again (in Firefox browser broken downloads can't be restored at all, as me how i found this, in Chrome it can be continued if you're lucky).
Summary: List of all available launchers is in Llama-cpp readme file. Loading parts implemented in the llama-cpp core, Ollama deny to update their software for now.