How can I use the GGUF quantization of this model in Ollama?

#2
by makisekurisu-jp - opened

For image generation, image editing, and image understanding, there are better choices available in ComfyUI.

However, due to its multimodal capabilities, this model would be well-suited for deployment on Ollama.

gguf-connector is more powerful than ollama (cited from connector : )🐷

For image generation, image editing, and image understanding, there are better choices available in ComfyUI.

However, due to its multimodal capabilities, this model would be well-suited for deployment on Ollama.

目前在ComfyUI中加载,还没有得到Comfy正式支持,自定义节点可以使用Dfloat11压缩或者FP16/8的块交换,效率都还没有达到极致。

Currently loaded in ComfyUI, it has not yet received official support from Comfy. Custom nodes can use Dfloat11 compression or FP16/8 block swapping, but the efficiency has not yet reached its peak.

gguf-connector is more powerful than ollama (cited from connector : )🐷

How do we use this model with gguf-connector? Since it doesn't work yet in Comfyui? What are you using for Bagel?

download the model and vae files; pull them in the same directory, then execute ggc b2; that's it

Screenshot 2025-05-29 121813.png

test the 3rd tab Image Understanding first, see does it work for your machine or not; since Text to Image and Image Edit you might need to wait 10x time longer than Image Understanding

Actually, "ggc b2" does not support .gguf files, currently it can only recognize .safetensors files.
So is this a trap set up to promote the "gguf-connector"?

support it now; upgrade your gguf-connector and execute ggc b2 again; build the engine takes time; we don't need to promote anything as you pay nothing for that/us; if you are not prepared to contribute anything please at least leave constructive comment

I apologize for my previous misunderstanding.
But even the new version seems to merely dequantize gguf for inference, without benefiting from the advantages of gguf, and instead adds a lengthy dequantization process.

that's alright; other engine works like that, since torch cannot deal with the gguf tensor straight, but transfer in layer level with mmap works better

switch back to safetensors for the model file right away; keep the vae only, then would be very fast; they stop working on it, said the output quality for picture seems not very consistent 🐷 and according to the library they used seems not assuming you to run it with retail gpu from day one; anyway

I don't really understand the loading logic of ggc b2, after the execution, only the vae file supports GGUF, the model file still points to safetensors, and the FP8 block swap strategy is no different from the original version, it still can't fully load the GPU, and the speed is the same as FP8.

我不是很理解ggc b2的加载逻辑,执行之后只有vae文件支持 GGUF,模型文件依旧指向safetensors,并且FP8块交换策略与原始版本并没有区别,依旧无法全量载入GPU,并且速度跟FP8一样。

you could adjust how much do you want to load into cuda 0 and system ram and/or cpu; the default setting from gguf-connector is auto detection already; please modify the engine here for custom allocation instead of using ggc b2 if that's your case

Even if it is quantized to GGUF, do we need Flash Attention 2 support to use it? I mean, if I use a graphics card with Volta architecture and below, do I need to upgrade it?

即使量化為GGUF,我們也需要Flash Attention 2支持使用它嗎?我的意思是,如果我使用Volta及以下架構的顯示卡,我是否需要升級它?

How do you guys get this working?? When I run 'ggc b2' I get an error that flash-attn isn't installed, then when I try to install flash-attn I get an error that CUDA_HOME environment variable isn't set, so I set that to where CUDA is installed and it just ignores the variable and says it isn't set. I've tried different Python versions because I was getting errors on 3.13. Tried 3.12, 3.11, 3.10. Every time a different error. Maddening. Pretty much giving up on Bagel and ggc.

Edit: Found possible fix here: https://huggingface.co/microsoft/Florence-2-base/discussions/4#6673387ae95291ddd44923b4 - I was using torch without CUDA support because I installed the requirements.txt, got the correct pip install command from pytorch site. Once I uninstalled/reinstalled the correct version and installed VC++ build tools, then flash-attn started to build.

Edit2: Now triton won't install because it apparently only supports Linux and the build fails with some obscure error. Officially done with ggc. I'll find another way to run Bagel or wait for a tool that actually works to support it. Hours wasted trying to get this to run.

actually you could opt to install it (flash-attn) with the pre-built wheel; might be easier, anyway

Sign up or log in to comment