Over 128k context on 1x 3090 TI FE 24GB VRAM!

#1
by ubergarm - opened

I've had good luck in early testing using this model to summarize scraped and cleaned html2text web content. It peaks out around ~55 tok/sec on short <8k context prompts, and bogs down to ~20 tok/sec on long context generation.

Gonna implement my agent workflow to concurrently process websites given there is plenty of context to spare for multiple slots to get that total aggregate throughput up!

./llama-server \
    --model "../models/mradermacher/Qwen2.5-14B-Instruct-1M-i1-GGUF/Qwen2.5-14B-Instruct-1M.i1-IQ4_XS.gguf" \
    --n-gpu-layers 49 \
    --ctx-size 147456 \
    --parallel 1 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --threads 16 \
    --flash-attn \
    --mlock \
    --n-predict -1 \
    --host 127.0.0.1 \
    --port 8080

Cheers and thanks for all the quants!

Sign up or log in to comment