Spaces:
Runtime error
Runtime error
Qwen3-VL Cookbook Alignment
This project mirrors the official Qwen3-VL cookbook patterns (see ../Qwen3-VL/cookbooks) when running the CoRGI pipeline with the real model.
Key Parallels
- Model + Processor loading: We rely on
AutoModelForImageTextToTextandAutoProcessorexactly as described in the main README and cookbook notebooks such asthink_with_images.ipynb. - Chat template:
Qwen3VLClientusesprocessor.apply_chat_template(..., add_generation_prompt=True)before callinggenerate, which matches the recommended multi-turn messaging flow. - Image transport: Both the pipeline and demo scripts accept PIL images and ensure conversion to RGB prior to inference, mirroring cookbook utilities that normalize channels.
- Max tokens & decoding: Default
max_new_tokens=512andtemperature=0.2align with cookbook demos favouring deterministic outputs for evaluation. - Single-model pipeline: All stages (reasoning, ROI extraction, answer synthesis) are executed by the same Qwen3-VL instance, following the cookbook philosophy of leveraging the model’s intrinsic grounding capability without external detectors.
Practical Tips for Local Inference
- Use the
pytorchConda env with the latesttransformers(>=4.45) to accessAutoModelForImageTextToTextsupport, as advised in the cookbook README. - When VRAM is limited, switch to
Qwen/Qwen3-VL-4B-Instructvia--model-idorCORGI_QWEN_MODELenvironment variable—no other code changes needed. - The integration test (
corgi_tests/test_integration_qwen.py) and demo (examples/demo_qwen_corgi.py) download the official demo image ifCORGI_DEMO_IMAGEis not supplied, matching cookbook notebooks that reference the same asset URL. - For reproducibility, set
HF_HOME(or use the cookbook’ssnapshot_download) to manage local caches and avoid repeated downloads. - The
Qwen/Qwen3-VL-8B-Thinkingcheckpoint often emits free-form “thinking” text instead of JSON; the pipeline now falls back to parsing those narratives for step and ROI extraction, and strips<think>…</think>scaffolding from final answers.
These notes ensure our CoRGI adaptation stays consistent with the official Qwen workflow while keeping the codebase modular for experimentation.