Why I didn't include quants, why I used this implementation
A lot of people asked me this on Discord and Reddit, so here it is in a nutshell:
Vision support leaves a lot to be desired. There are numerous issues when using various popular frontends. While some work—or partially work—the provided code delivers the best results in terms of accuracy, compatibility, and bulk inference (you can infer 999 images with a single command, though it will obviously take some time).
The code also offers ease of use and flexibility: just put your prompt in 'prompts.txt' and you're set. Want to run several prompts? Simply add another line; each line is treated as a separate prompt.
Do I think this implementation is awesome? Hell no, I really dislike it. But it’s what works best. Even though vision has existed for years, support was consistently poor across the board. Now, in 2025, we're likely to see drastic improvements