Spaces:

tuandunghcmut
/

corgi-qwen3-vl-demo

Runtime error

corgi-qwen3-vl-demo / QWEN_INFERENCE_NOTES.md

dung-vpt-uney

Deploy latest CoRGI Gradio demo

b6a01d6 29 days ago

2.28 kB

Qwen3-VL Cookbook Alignment

This project mirrors the official Qwen3-VL cookbook patterns (see ../Qwen3-VL/cookbooks) when running the CoRGI pipeline with the real model.

Model + Processor loading: We rely on AutoModelForImageTextToText and AutoProcessor exactly as described in the main README and cookbook notebooks such as think_with_images.ipynb.
Chat template: Qwen3VLClient uses processor.apply_chat_template(..., add_generation_prompt=True) before calling generate, which matches the recommended multi-turn messaging flow.
Image transport: Both the pipeline and demo scripts accept PIL images and ensure conversion to RGB prior to inference, mirroring cookbook utilities that normalize channels.
Max tokens & decoding: Default max_new_tokens=512 and temperature=0.2 align with cookbook demos favouring deterministic outputs for evaluation.
Single-model pipeline: All stages (reasoning, ROI extraction, answer synthesis) are executed by the same Qwen3-VL instance, following the cookbook philosophy of leveraging the model’s intrinsic grounding capability without external detectors.

Use the pytorch Conda env with the latest transformers (>=4.45) to access AutoModelForImageTextToText support, as advised in the cookbook README.
When VRAM is limited, switch to Qwen/Qwen3-VL-4B-Instruct via --model-id or CORGI_QWEN_MODEL environment variable—no other code changes needed.
The integration test (corgi_tests/test_integration_qwen.py) and demo (examples/demo_qwen_corgi.py) download the official demo image if CORGI_DEMO_IMAGE is not supplied, matching cookbook notebooks that reference the same asset URL.
For reproducibility, set HF_HOME (or use the cookbook’s snapshot_download) to manage local caches and avoid repeated downloads.
The Qwen/Qwen3-VL-8B-Thinking checkpoint often emits free-form “thinking” text instead of JSON; the pipeline now falls back to parsing those narratives for step and ROI extraction, and strips <think>…</think> scaffolding from final answers.

These notes ensure our CoRGI adaptation stays consistent with the official Qwen workflow while keeping the codebase modular for experimentation.