Spaces:

tuandunghcmut
/

corgi-qwen3-vl-demo

Runtime error

App Files Files Community

dung-vpt-uney commited on 15 days ago

Commit

b6a01d6

1 Parent(s): b9f8b29

Deploy latest CoRGI Gradio demo

Browse files

Files changed (32) hide show

PROGRESS_LOG.md +33 -0
PROJECT_PLAN.md +51 -0
QWEN_INFERENCE_NOTES.md +19 -0
README.md +12 -11
app.py +10 -0
corgi/__init__.py +11 -0
corgi/__pycache__/__init__.cpython-310.pyc +0 -0
corgi/__pycache__/__init__.cpython-312.pyc +0 -0
corgi/__pycache__/__init__.cpython-313.pyc +0 -0
corgi/__pycache__/cli.cpython-312.pyc +0 -0
corgi/__pycache__/cli.cpython-313.pyc +0 -0
corgi/__pycache__/gradio_app.cpython-312.pyc +0 -0
corgi/__pycache__/gradio_app.cpython-313.pyc +0 -0
corgi/__pycache__/parsers.cpython-310.pyc +0 -0
corgi/__pycache__/parsers.cpython-312.pyc +0 -0
corgi/__pycache__/parsers.cpython-313.pyc +0 -0
corgi/__pycache__/pipeline.cpython-310.pyc +0 -0
corgi/__pycache__/pipeline.cpython-312.pyc +0 -0
corgi/__pycache__/pipeline.cpython-313.pyc +0 -0
corgi/__pycache__/qwen_client.cpython-312.pyc +0 -0
corgi/__pycache__/qwen_client.cpython-313.pyc +0 -0
corgi/__pycache__/types.cpython-310.pyc +0 -0
corgi/__pycache__/types.cpython-312.pyc +0 -0
corgi/__pycache__/types.cpython-313.pyc +0 -0
corgi/cli.py +131 -0
corgi/gradio_app.py +166 -0
corgi/parsers.py +390 -0
corgi/pipeline.py +92 -0
corgi/qwen_client.py +176 -0
corgi/types.py +61 -0
examples/demo_qwen_corgi.py +85 -0
requirements.txt +7 -0

PROGRESS_LOG.md ADDED Viewed

	@@ -0,0 +1,33 @@

+# CoRGI Custom Demo — Progress Log
+> Keep this log short and chronological. Newest updates at the top.
+## 2024-10-22
+- Reproduced the CoRGI pipeline failure with the real `Qwen/Qwen3-VL-8B-Thinking` checkpoint and traced it to reasoning outputs that only use ordinal step words.
+- Taught the text parser to normalize “First/Second step” style markers into numeric indices, refreshed the unit tests to cover the new heuristic, and reran the demo/end-to-end pipeline successfully.
+- Tidied Qwen generation settings to avoid unused temperature flags when running deterministically.
+- Validated ROI extraction on a vision-heavy prompt against the real model and hardened prompts so responses stay in structured JSON without verbose preambles.
+- Added meta-comment pruning so thinking-mode rambles (e.g., redundant “Step 3” reflections) are dropped while preserving genuine reasoning; confirmed with the official demo image that only meaningful steps remain.
+## 2024-10-21
+- Updated default checkpoints to `Qwen/Qwen3-VL-8B-Thinking` and verified CLI/Gradio/test coverage.
+- Exercised the real model to capture thinking-style outputs; added parser fallbacks for textual reasoning/ROI responses and stripped `<think>` tags from answer synthesis.
+- Extended unit test suite (reasoning, ROI, client helpers) to cover the new parsing paths and ran `pytest` successfully.
+## 2024-10-20
+- Added optional integration test (`corgi_tests/test_integration_qwen.py`) gated by `CORGI_RUN_QWEN_INTEGRATION` for running the real Qwen3-VL model on the official demo asset.
+- Created runnable example script (`examples/demo_qwen_corgi.py`) to reproduce the Hugging Face demo prompt locally with structured pipeline logging.
+- Published Hugging Face Space harness (`app.py`) and deployment helper (`scripts/push_space.sh`) including requirements for ZeroGPU tier.
+- Documented cookbook alignment and inference tips (`QWEN_INFERENCE_NOTES.md`).
+- Added CLI runner (`corgi.cli`) with formatting helpers plus JSON export; authored matching unittest coverage.
+- Implemented Gradio demo harness (`corgi.gradio_app`) with markdown reporting and helper utilities for dependency injection.
+- Expanded unit test suite (CLI + Gradio) and ran `pytest corgi_tests` successfully (1 skip when gradio missing).
+- Initialized structured project plan and progress log scaffolding.
+- Assessed existing modules (`corgi.pipeline`, `corgi.qwen_client`, parsers, tests) to identify pending demo features (CLI + Gradio).
+- Confirmed Qwen3-VL will be the single backbone for reasoning, ROI verification, and answer synthesis.
+<!-- Template for future updates:
+## YYYY-MM-DD
+- Summary of change / milestone.
+- Follow-up actions.
+-->

PROJECT_PLAN.md ADDED Viewed

	@@ -0,0 +1,51 @@

+# CoRGI Custom Demo — Project Plan
+## Context
+- **Objective**: ship a runnable CoRGI demo (CLI + Gradio) powered entirely by Qwen3-VL for structured reasoning, ROI evidence extraction, and answer synthesis.
+- **Scope**: stay within the `corgi_custom` package, reuse Qwen3-VL cookbooks where possible, keep dependency footprint minimal (no extra detectors/rerankers).
+- **Environment**: Conda env `pytorch`, default VLM `Qwen/Qwen3-VL-8B-Thinking`.
+## Milestones
+| Status | Milestone | Notes |
+| --- | --- | --- |
+| ✅ | Core pipeline skeleton (dataclasses, parsers, Qwen client wrappers) | Already merged in repo. |
+| ✅ | Project documentation & progress tracking scaffolding | Plan + progress log committed. |
+| ✅ | CLI runner that prints step-by-step pipeline output | Supports overrides + JSON export. |
+| ✅ | Gradio demo mirroring CLI functionality | Blocks UI with markdown report messaging. |
+| ✅ | Automated tests for new modules | CLI + Gradio helpers covered with unit tests. |
+| ✅ | HF Space deployment automation | Bash script + app harness for zerogpu Spaces. |
+| 🟡 | Final verification (unit tests, smoke instructions) | Document how to run `pytest` and the demos. |
+## Work Breakdown Structure
+1. **Docs & Tracking**
+   - [x] Finalize plan and progress log templates.
+   - [x] Document environment setup expectations.
+2. **Pipeline UX**
+   - [x] Implement CLI entrypoint (`corgi.cli:main`).
+   - [x] Provide structured stdout for steps/evidence/answer.
+   - [x] Allow optional JSON dump for downstream tooling.
+3. **Interactive Demo**
+   - [x] Build Gradio app harness (image upload + question textbox).
+   - [ ] Stream progress (optional) and display textual reasoning/evidence.
+   - [x] Handle model loading errors gracefully.
+4. **Testing & Tooling**
+   - [x] Add fixture-friendly helpers to avoid heavy model loads in tests.
+   - [x] Write unit tests for CLI argument parsing + formatting.
+   - [ ] Add regression test for pipeline serialization.
+5. **Docs & Hand-off**
+   - [ ] Update README/demo instructions.
+   - [ ] Provide sample command sequences for CLI/Gradio.
+   - [ ] Capture open risks & future enhancements.
+6. **Deployment & Ops**
+   - [x] Add Hugging Face Space entrypoint (`app.py`).
+   - [x] Write deployment helper script (`scripts/push_space.sh`).
+   - [ ] Add automated checklists/logs for Space updates.
+## Risks & Mitigations
+- **Model loading latency / VRAM** → expose config knobs and mention 4B fallback.
+- **Parsing drift from Qwen outputs** → keep parser tolerant; add debug flag to dump raw responses.
+- **Test runtime** → mock Qwen client via fixtures; avoid loading real model in unit tests.
+## Progress Tracking
+- Refer to `PROGRESS_LOG.md` for dated status updates.
+- Update milestone table whenever a deliverable completes.

QWEN_INFERENCE_NOTES.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# Qwen3-VL Cookbook Alignment
+This project mirrors the official Qwen3-VL cookbook patterns (see `../Qwen3-VL/cookbooks`) when running the CoRGI pipeline with the real model.
+## Key Parallels
+- **Model + Processor loading**: We rely on `AutoModelForImageTextToText` and `AutoProcessor` exactly as described in the main README and cookbook notebooks such as `think_with_images.ipynb`.
+- **Chat template**: `Qwen3VLClient` uses `processor.apply_chat_template(..., add_generation_prompt=True)` before calling `generate`, which matches the recommended multi-turn messaging flow.
+- **Image transport**: Both the pipeline and demo scripts accept PIL images and ensure conversion to RGB prior to inference, mirroring cookbook utilities that normalize channels.
+- **Max tokens & decoding**: Default `max_new_tokens=512` and `temperature=0.2` align with cookbook demos favouring deterministic outputs for evaluation.
+- **Single-model pipeline**: All stages (reasoning, ROI extraction, answer synthesis) are executed by the same Qwen3-VL instance, following the cookbook philosophy of leveraging the model’s intrinsic grounding capability without external detectors.
+## Practical Tips for Local Inference
+- Use the `pytorch` Conda env with the latest `transformers` (>=4.45) to access `AutoModelForImageTextToText` support, as advised in the cookbook README.
+- When VRAM is limited, switch to `Qwen/Qwen3-VL-4B-Instruct` via `--model-id` or `CORGI_QWEN_MODEL` environment variable—no other code changes needed.
+- The integration test (`corgi_tests/test_integration_qwen.py`) and demo (`examples/demo_qwen_corgi.py`) download the official demo image if `CORGI_DEMO_IMAGE` is not supplied, matching cookbook notebooks that reference the same asset URL.
+- For reproducibility, set `HF_HOME` (or use the cookbook’s `snapshot_download`) to manage local caches and avoid repeated downloads.
+- The `Qwen/Qwen3-VL-8B-Thinking` checkpoint often emits free-form “thinking” text instead of JSON; the pipeline now falls back to parsing those narratives for step and ROI extraction, and strips `<think>…</think>` scaffolding from final answers.
+These notes ensure our CoRGI adaptation stays consistent with the official Qwen workflow while keeping the codebase modular for experimentation.

README.md CHANGED Viewed

@@ -1,12 +1,13 @@
----
-title: Corgi Qwen3 Vl Demo
-emoji: 😻
-colorFrom: green
-colorTo: indigo
-sdk: gradio
-sdk_version: 5.49.1
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# CoRGI Qwen3-VL Demo
+This Space hosts the CoRGI reasoning pipeline backed by the Qwen/Qwen3-VL-8B-Thinking model.
+## Run Locally
+```
+pip install -r requirements.txt
+python examples/demo_qwen_corgi.py
+```
+## Notes
+- The demo queues requests sequentially (ZeroGPU/cpu-basic hardware).
+- Configure `CORGI_QWEN_MODEL` to switch to a different checkpoint.

app.py ADDED Viewed

	@@ -0,0 +1,10 @@

+"""Hugging Face Spaces entrypoint for the CoRGI Qwen3-VL demo."""
+from corgi.gradio_app import build_demo
+demo = build_demo()
+demo.queue(concurrency_count=1)
+if __name__ == "__main__":
+    demo.launch()

corgi/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""CoRGI pipeline package using Qwen3-VL."""
+from .pipeline import CoRGIPipeline, PipelineResult
+from .types import GroundedEvidence, ReasoningStep
+__all__ = [
+    "CoRGIPipeline",
+    "PipelineResult",
+    "GroundedEvidence",
+    "ReasoningStep",
+]

corgi/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (413 Bytes). View file

corgi/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (419 Bytes). View file

corgi/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (419 Bytes). View file

corgi/__pycache__/cli.cpython-312.pyc ADDED Viewed

Binary file (6.41 kB). View file

corgi/__pycache__/cli.cpython-313.pyc ADDED Viewed

Binary file (6.39 kB). View file

corgi/__pycache__/gradio_app.cpython-312.pyc ADDED Viewed

Binary file (8.03 kB). View file

corgi/__pycache__/gradio_app.cpython-313.pyc ADDED Viewed

Binary file (8.24 kB). View file

corgi/__pycache__/parsers.cpython-310.pyc ADDED Viewed

Binary file (4.61 kB). View file

corgi/__pycache__/parsers.cpython-312.pyc ADDED Viewed

Binary file (18.1 kB). View file

corgi/__pycache__/parsers.cpython-313.pyc ADDED Viewed

Binary file (18.8 kB). View file

corgi/__pycache__/pipeline.cpython-310.pyc ADDED Viewed

Binary file (3.13 kB). View file

corgi/__pycache__/pipeline.cpython-312.pyc ADDED Viewed

Binary file (3.86 kB). View file

corgi/__pycache__/pipeline.cpython-313.pyc ADDED Viewed

Binary file (3.97 kB). View file

corgi/__pycache__/qwen_client.cpython-312.pyc ADDED Viewed

Binary file (9.01 kB). View file

corgi/__pycache__/qwen_client.cpython-313.pyc ADDED Viewed

Binary file (9.13 kB). View file

corgi/__pycache__/types.cpython-310.pyc ADDED Viewed

Binary file (2.16 kB). View file

corgi/__pycache__/types.cpython-312.pyc ADDED Viewed

Binary file (2.67 kB). View file

corgi/__pycache__/types.cpython-313.pyc ADDED Viewed

Binary file (2.77 kB). View file

corgi/cli.py ADDED Viewed

	@@ -0,0 +1,131 @@

+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import Callable, Optional, TextIO
+from PIL import Image
+from .pipeline import CoRGIPipeline
+from .qwen_client import Qwen3VLClient, QwenGenerationConfig
+from .types import GroundedEvidence, ReasoningStep
+DEFAULT_MODEL_ID = "Qwen/Qwen3-VL-8B-Thinking"
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(
+        prog="corgi-cli",
+        description="Run the CoRGI reasoning pipeline over an image/question pair.",
+    )
+    parser.add_argument("--image", type=Path, required=True, help="Path to the input image (jpg/png/etc.)")
+    parser.add_argument("--question", type=str, required=True, help="Visual question for the image")
+    parser.add_argument("--max-steps", type=int, default=4, help="Maximum number of reasoning steps to request")
+    parser.add_argument(
+        "--max-regions",
+        type=int,
+        default=4,
+        help="Maximum number of grounded regions per visual step",
+    )
+    parser.add_argument(
+        "--model-id",
+        type=str,
+        default=None,
+        help="Optional override for the Qwen3-VL model identifier",
+    )
+    parser.add_argument(
+        "--json-out",
+        type=Path,
+        default=None,
+        help="Optional path to write the pipeline result as JSON",
+    )
+    return parser
+def _format_step(step: ReasoningStep) -> str:
+    needs = "yes" if step.needs_vision else "no"
+    suffix = f"; reason: {step.reason}" if step.reason else ""
+    return f"[{step.index}] {step.statement} (needs vision: {needs}{suffix})"
+def _format_evidence_item(evidence: GroundedEvidence) -> str:
+    bbox = ", ".join(f"{coord:.2f}" for coord in evidence.bbox)
+    parts = [f"Step {evidence.step_index} | bbox=({bbox})"]
+    if evidence.description:
+        parts.append(f"desc: {evidence.description}")
+    if evidence.confidence is not None:
+        parts.append(f"conf: {evidence.confidence:.2f}")
+    return " | ".join(parts)
+def _default_pipeline_factory(model_id: Optional[str]) -> CoRGIPipeline:
+    config = QwenGenerationConfig(model_id=model_id or DEFAULT_MODEL_ID)
+    client = Qwen3VLClient(config=config)
+    return CoRGIPipeline(vlm_client=client)
+def execute_cli(
+    *,
+    image_path: Path,
+    question: str,
+    max_steps: int,
+    max_regions: int,
+    model_id: Optional[str],
+    json_out: Optional[Path],
+    pipeline_factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
+    output_stream: TextIO | None = None,
+) -> None:
+    if output_stream is None:
+        output_stream = sys.stdout
+    factory = pipeline_factory or _default_pipeline_factory
+    with Image.open(image_path) as img:
+        image = img.convert("RGB")
+        pipeline = factory(model_id)
+        result = pipeline.run(
+            image=image,
+            question=question,
+            max_steps=max_steps,
+            max_regions=max_regions,
+        )
+    print(f"Question: {question}", file=output_stream)
+    print("-- Steps --", file=output_stream)
+    for step in result.steps:
+        print(_format_step(step), file=output_stream)
+    if not result.steps:
+        print("(no reasoning steps returned)", file=output_stream)
+    print("-- Evidence --", file=output_stream)
+    if result.evidence:
+        for evidence in result.evidence:
+            print(_format_evidence_item(evidence), file=output_stream)
+    else:
+        print("(no visual evidence)", file=output_stream)
+    print("-- Answer --", file=output_stream)
+    print(f"Answer: {result.answer}", file=output_stream)
+    if json_out is not None:
+        json_out.parent.mkdir(parents=True, exist_ok=True)
+        with json_out.open("w", encoding="utf-8") as handle:
+            json.dump(result.to_json(), handle, ensure_ascii=False, indent=2)
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = build_parser()
+    args = parser.parse_args(argv)
+    execute_cli(
+        image_path=args.image,
+        question=args.question,
+        max_steps=args.max_steps,
+        max_regions=args.max_regions,
+        model_id=args.model_id,
+        json_out=args.json_out,
+    )
+    return 0
+__all__ = ["build_parser", "execute_cli", "main"]

corgi/gradio_app.py ADDED Viewed

	@@ -0,0 +1,166 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Callable, Optional
+from PIL import Image
+from .cli import DEFAULT_MODEL_ID
+from .pipeline import CoRGIPipeline, PipelineResult
+from .qwen_client import Qwen3VLClient, QwenGenerationConfig
+@dataclass
+class PipelineState:
+    model_id: str
+    pipeline: Optional[CoRGIPipeline]
+def _default_factory(model_id: Optional[str]) -> CoRGIPipeline:
+    config = QwenGenerationConfig(model_id=model_id or DEFAULT_MODEL_ID)
+    return CoRGIPipeline(vlm_client=Qwen3VLClient(config=config))
+def ensure_pipeline_state(
+    previous: Optional[PipelineState],
+    model_id: Optional[str],
+    factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
+) -> PipelineState:
+    target_model = model_id or DEFAULT_MODEL_ID
+    factory = factory or _default_factory
+    if previous is not None and previous.model_id == target_model:
+        return previous
+    pipeline = factory(target_model)
+    return PipelineState(model_id=target_model, pipeline=pipeline)
+def format_result_markdown(result: PipelineResult) -> str:
+    lines: list[str] = []
+    lines.append("### Answer")
+    lines.append(result.answer or "(no answer returned)")
+    lines.append("")
+    lines.append("### Reasoning Steps")
+    if result.steps:
+        for step in result.steps:
+            needs = "yes" if step.needs_vision else "no"
+            reason = f" — {step.reason}" if step.reason else ""
+            lines.append(f"- **Step {step.index}**: {step.statement} _(needs vision: {needs})_{reason}")
+    else:
+        lines.append("- No reasoning steps returned.")
+    lines.append("")
+    lines.append("### Visual Evidence")
+    if result.evidence:
+        for ev in result.evidence:
+            bbox = ", ".join(f"{coord:.2f}" for coord in ev.bbox)
+            desc = ev.description or "(no description)"
+            conf = f" — confidence {ev.confidence:.2f}" if ev.confidence is not None else ""
+            lines.append(f"- Step {ev.step_index}: bbox=({bbox}) — {desc}{conf}")
+    else:
+        lines.append("- No visual evidence collected.")
+    return "\n".join(lines)
+def _run_pipeline(
+    state: Optional[PipelineState],
+    image: Image.Image | None,
+    question: str,
+    max_steps: int,
+    max_regions: int,
+    model_id: Optional[str],
+    factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
+) -> tuple[PipelineState, str]:
+    if image is None:
+        return state or PipelineState(model_id=model_id or DEFAULT_MODEL_ID, pipeline=None), "Please provide an image before running the demo."
+    if not question.strip():
+        return state or PipelineState(model_id=model_id or DEFAULT_MODEL_ID, pipeline=None), "Please enter a question before running the demo."
+    new_state = ensure_pipeline_state(state if state and state.pipeline else None, model_id, factory)
+    result = new_state.pipeline.run(
+        image=image.convert("RGB"),
+        question=question.strip(),
+        max_steps=int(max_steps),
+        max_regions=int(max_regions),
+    )
+    markdown = format_result_markdown(result)
+    return new_state, markdown
+def build_demo(
+    pipeline_factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
+) -> "gradio.Blocks":
+    try:
+        import gradio as gr
+    except ImportError as exc:  # pragma: no cover - exercised when gradio missing
+        raise RuntimeError("Gradio is required to build the demo. Install gradio>=4.0.") from exc
+    factory = pipeline_factory or _default_factory
+    with gr.Blocks(title="CoRGI Qwen3-VL Demo") as demo:
+        state = gr.State()  # stores PipelineState
+        with gr.Row():
+            with gr.Column(scale=1, min_width=320):
+                image_input = gr.Image(label="Input image", type="pil")
+                question_input = gr.Textbox(label="Question", placeholder="What is happening in the image?", lines=2)
+                model_id_input = gr.Textbox(
+                    label="Model ID",
+                    value=DEFAULT_MODEL_ID,
+                    placeholder="Leave blank to use default",
+                )
+                max_steps_slider = gr.Slider(
+                    label="Max reasoning steps",
+                    minimum=1,
+                    maximum=6,
+                    step=1,
+                    value=4,
+                )
+                max_regions_slider = gr.Slider(
+                    label="Max regions per step",
+                    minimum=1,
+                    maximum=6,
+                    step=1,
+                    value=4,
+                )
+                run_button = gr.Button("Run CoRGI")
+            with gr.Column(scale=1, min_width=320):
+                result_markdown = gr.Markdown(value="Upload an image and ask a question to begin.")
+        def _on_submit(state_data, image, question, model_id, max_steps, max_regions):
+            pipeline_state = state_data if isinstance(state_data, PipelineState) else None
+            new_state, markdown = _run_pipeline(
+                pipeline_state,
+                image,
+                question,
+                int(max_steps),
+                int(max_regions),
+                model_id if model_id else None,
+                factory,
+            )
+            return new_state, markdown
+        run_button.click(
+            fn=_on_submit,
+            inputs=[state, image_input, question_input, model_id_input, max_steps_slider, max_regions_slider],
+            outputs=[state, result_markdown],
+        )
+    return demo
+def launch_demo(
+    *,
+    pipeline_factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
+    **launch_kwargs,
+) -> None:
+    demo = build_demo(pipeline_factory=pipeline_factory)
+    demo.launch(**launch_kwargs)
+__all__ = [
+    "PipelineState",
+    "ensure_pipeline_state",
+    "format_result_markdown",
+    "build_demo",
+    "launch_demo",
+    "DEFAULT_MODEL_ID",
+]

corgi/parsers.py ADDED Viewed

	@@ -0,0 +1,390 @@

+from __future__ import annotations
+import json
+import re
+from typing import Any, Iterable, List
+from .types import GroundedEvidence, ReasoningStep
+_JSON_FENCE_RE = re.compile(r"```(?:json)?(.*?)```", re.DOTALL | re.IGNORECASE)
+_STEP_MARKER_RE = re.compile(r"(?im)(?:^|\n)\s*(?:step\s*(\d+)|(\d+)[\.\)])\s*[:\-]?\s*")
+_NEEDS_VISION_RE = re.compile(
+    r"needs[\s_]*vision\s*[:\-]?\s*(?P<value>true|false|yes|no|required|not required|necessary|unnecessary)",
+    re.IGNORECASE,
+)
+_REASON_RE = re.compile(r"reason\s*[:\-]\s*(?P<value>.+)", re.IGNORECASE)
+_BOX_RE = re.compile(
+    r"\[\s*-?\d+(?:\.\d+)?\s*,\s*-?\d+(?:\.\d+)?\s*,\s*-?\d+(?:\.\d+)?\s*,\s*-?\d+(?:\.\d+)?\s*\]"
+)
+_ORDINAL_WORD_MAP = {
+    "first": 1,
+    "second": 2,
+    "third": 3,
+    "fourth": 4,
+    "fifth": 5,
+    "sixth": 6,
+    "seventh": 7,
+    "eighth": 8,
+    "ninth": 9,
+    "tenth": 10,
+}
+_NUMBER_WORD_MAP = {
+    "one": 1,
+    "two": 2,
+    "three": 3,
+    "four": 4,
+    "five": 5,
+    "six": 6,
+    "seven": 7,
+    "eight": 8,
+    "nine": 9,
+    "ten": 10,
+}
+_ORDINAL_STEP_RE = re.compile(
+    r"(?im)\b(?P<word>first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth)\s+step\b"
+)
+_WORD_STEP_RE = re.compile(
+    r"(?im)\bstep\s+(?P<word>one|two|three|four|five|six|seven|eight|nine|ten)\b"
+)
+_META_TOKENS = {"maybe", "wait", "let's", "lets", "question", "protocol"}
+def _to_bool(value: Any) -> bool:
+    if isinstance(value, bool):
+        return value
+    if value is None:
+        return False
+    if isinstance(value, (int, float)):
+        return value != 0
+    if isinstance(value, str):
+        lowered = value.strip().lower()
+        if lowered in {"true", "t", "yes", "y", "1"}:
+            return True
+        if lowered in {"false", "f", "no", "n", "0"}:
+            return False
+    return False
+def _extract_json_strings(text: str) -> Iterable[str]:
+    """Return candidate JSON payloads from the response text."""
+    fenced = _JSON_FENCE_RE.findall(text)
+    if fenced:
+        for body in fenced:
+            yield body.strip()
+    stripped = text.strip()
+    if stripped:
+        yield stripped
+def _load_first_json(text: str) -> Any:
+    last_error = None
+    for candidate in _extract_json_strings(text):
+        try:
+            return json.loads(candidate)
+        except json.JSONDecodeError as err:
+            last_error = err
+            continue
+    if last_error:
+        raise ValueError(f"Unable to parse JSON from response: {last_error}") from last_error
+    raise ValueError("Empty response, cannot parse JSON.")
+def _trim_reasoning_text(text: str) -> str:
+    lowered = text.lower()
+    for anchor in ("let's draft", "draft:", "structured steps", "final reasoning"):
+        pos = lowered.rfind(anchor)
+        if pos != -1:
+            return text[pos:]
+    return text
+def _clean_sentence(text: str) -> str:
+    return " ".join(text.strip().split())
+def _normalize_step_markers(text: str) -> str:
+    """Convert ordinal step markers into numeric form (e.g., 'First step' -> 'Step 1')."""
+    def replace_ordinal(match: re.Match[str]) -> str:
+        word = match.group("word").lower()
+        num = _ORDINAL_WORD_MAP.get(word)
+        return f"Step {num}" if num is not None else match.group(0)
+    def replace_word_number(match: re.Match[str]) -> str:
+        word = match.group("word").lower()
+        num = _NUMBER_WORD_MAP.get(word)
+        return f"Step {num}" if num is not None else match.group(0)
+    normalized = _ORDINAL_STEP_RE.sub(replace_ordinal, text)
+    normalized = _WORD_STEP_RE.sub(replace_word_number, normalized)
+    return normalized
+def _extract_statement(body: str) -> str | None:
+    statement_match = re.search(r"statement\s*[:\-]\s*(.+)", body, re.IGNORECASE)
+    candidate = statement_match.group(1) if statement_match else body
+    # Remove trailing sections that describe vision or reason metadata.
+    candidate = re.split(r"(?i)needs\s*vision|reason\s*[:\-]", candidate)[0]
+    candidate = candidate.strip().strip(".")
+    if not candidate:
+        return None
+    return _clean_sentence(candidate)
+def _extract_needs_vision(body: str) -> bool:
+    match = _NEEDS_VISION_RE.search(body)
+    if not match:
+        return True
+    token = match.group("value").strip().lower()
+    if token in {"not required", "unnecessary"}:
+        return False
+    if token in {"required", "necessary"}:
+        return True
+    return _to_bool(token)
+def _extract_reason(body: str) -> str | None:
+    match = _REASON_RE.search(body)
+    if match:
+        reason = match.group("value").strip()
+        reason = re.split(r"(?i)needs\s*vision", reason)[0].strip()
+        reason = reason.rstrip(".")
+        return reason or None
+    because_match = re.search(r"because\s+(.+?)(?:\.|$)", body, re.IGNORECASE)
+    if because_match:
+        reason = because_match.group(1).strip().rstrip(".")
+        return reason or None
+    return None
+def _parse_step_block(index_guess: int, body: str) -> ReasoningStep | None:
+    statement = _extract_statement(body)
+    if not statement:
+        return None
+    needs_vision = _extract_needs_vision(body)
+    reason = _extract_reason(body)
+    index = index_guess if index_guess > 0 else 1
+    return ReasoningStep(index=index, statement=statement, needs_vision=needs_vision, reason=reason)
+def _parse_reasoning_from_text(response_text: str, max_steps: int) -> List[ReasoningStep]:
+    text = _trim_reasoning_text(response_text)
+    text = _normalize_step_markers(text)
+    matches = list(_STEP_MARKER_RE.finditer(text))
+    if not matches:
+        return []
+    steps_map: dict[int, ReasoningStep] = {}
+    ordering: List[int] = []
+    fallback_index = 1
+    for idx, marker in enumerate(matches):
+        start = marker.end()
+        end = matches[idx + 1].start() if idx + 1 < len(matches) else len(text)
+        body = text[start:end].strip()
+        if not body:
+            continue
+        raw_index = marker.group(1) or marker.group(2)
+        try:
+            index_guess = int(raw_index) if raw_index else fallback_index
+        except (TypeError, ValueError):
+            index_guess = fallback_index
+        if raw_index is None:
+            fallback_index += 1
+        step = _parse_step_block(index_guess, body)
+        if step is None:
+            continue
+        if step.index not in steps_map:
+            ordering.append(step.index)
+        steps_map[step.index] = step
+        if len(ordering) >= max_steps:
+            break
+    return [steps_map[idx] for idx in ordering[:max_steps]]
+def _looks_like_meta_statement(statement: str) -> bool:
+    lowered = statement.lower()
+    if any(token in lowered for token in _META_TOKENS) and "step" in lowered:
+        return True
+    if lowered.startswith(("maybe", "wait", "let's", "lets")):
+        return True
+    if len(statement) > 260 and "step" in lowered:
+        return True
+    return False
+def _prune_steps(steps: List[ReasoningStep]) -> List[ReasoningStep]:
+    filtered: List[ReasoningStep] = []
+    seen_statements: set[str] = set()
+    for step in steps:
+        normalized = step.statement.strip().lower()
+        if _looks_like_meta_statement(step.statement):
+            continue
+        if normalized in seen_statements:
+            continue
+        seen_statements.add(normalized)
+        filtered.append(step)
+    return filtered or steps
+def _extract_description(text: str, start_index: int) -> str | None:
+    boundary = max(text.rfind("\n", 0, start_index), text.rfind(".", 0, start_index))
+    if boundary == -1:
+        boundary = 0
+    snippet = text[boundary:start_index].strip(" \n.:–-")
+    if not snippet:
+        return None
+    return _clean_sentence(snippet)
+def _parse_roi_from_text(response_text: str, default_step_index: int) -> List[GroundedEvidence]:
+    evidences: List[GroundedEvidence] = []
+    seen: set[tuple[float, float, float, float]] = set()
+    for match in _BOX_RE.finditer(response_text):
+        coords_str = match.group(0).strip("[]")
+        try:
+            coords = [float(part.strip()) for part in coords_str.split(",")]
+        except ValueError:
+            continue
+        if len(coords) != 4:
+            continue
+        try:
+            bbox = _normalize_bbox(coords)
+        except ValueError:
+            continue
+        key = tuple(round(c, 4) for c in bbox)
+        if key in seen:
+            continue
+        description = _extract_description(response_text, match.start())
+        evidences.append(
+            GroundedEvidence(
+                step_index=default_step_index,
+                bbox=bbox,
+                description=description,
+                confidence=None,
+                raw_source={"bbox": coords, "description": description},
+            )
+        )
+        seen.add(key)
+    return evidences
+def parse_structured_reasoning(response_text: str, max_steps: int) -> List[ReasoningStep]:
+    """Parse Qwen3-VL structured reasoning output into dataclasses."""
+    try:
+        payload = _load_first_json(response_text)
+    except ValueError as json_error:
+        steps = _parse_reasoning_from_text(response_text, max_steps=max_steps)
+        if steps:
+            return _prune_steps(steps)[:max_steps]
+        raise json_error
+    if not isinstance(payload, list):
+        raise ValueError("Structured reasoning response must be a JSON list.")
+    steps: List[ReasoningStep] = []
+    for idx, item in enumerate(payload, start=1):
+        if not isinstance(item, dict):
+            continue
+        statement = item.get("statement") or item.get("step") or item.get("text")
+        if not isinstance(statement, str):
+            continue
+        statement = statement.strip()
+        if not statement:
+            continue
+        step_index = item.get("index")
+        if not isinstance(step_index, int):
+            step_index = idx
+        needs_vision = _to_bool(item.get("needs_vision") or item.get("requires_vision"))
+        reason = item.get("reason") or item.get("justification")
+        if isinstance(reason, str):
+            reason = reason.strip() or None
+        else:
+            reason = None
+        steps.append(ReasoningStep(index=step_index, statement=statement, needs_vision=needs_vision, reason=reason))
+        if len(steps) >= max_steps:
+            break
+    steps = _prune_steps(steps)[:max_steps]
+    if not steps:
+        raise ValueError("No reasoning steps parsed from response.")
+    return steps
+def _normalize_bbox(bbox: Any) -> tuple[float, float, float, float]:
+    if not isinstance(bbox, (list, tuple)) or len(bbox) != 4:
+        raise ValueError(f"Bounding box must be a list of 4 numbers, got {bbox!r}")
+    coords = []
+    for raw in bbox:
+        if isinstance(raw, str):
+            raw = raw.strip()
+            if not raw:
+                raw = 0
+            else:
+                raw = float(raw)
+        elif isinstance(raw, (int, float)):
+            raw = float(raw)
+        else:
+            raw = 0.0
+        coords.append(raw)
+    scale = max(abs(v) for v in coords) if coords else 1.0
+    if scale > 1.5:  # assume 0..1000 or pixel coordinates
+        coords = [max(0.0, min(v / 1000.0, 1.0)) for v in coords]
+    else:
+        coords = [max(0.0, min(v, 1.0)) for v in coords]
+    x1, y1, x2, y2 = coords
+    x_min, x_max = sorted((x1, x2))
+    y_min, y_max = sorted((y1, y2))
+    return (x_min, y_min, x_max, y_max)
+def parse_roi_evidence(response_text: str, default_step_index: int) -> List[GroundedEvidence]:
+    """Parse ROI grounding output into evidence structures."""
+    try:
+        payload = _load_first_json(response_text)
+    except ValueError:
+        return _parse_roi_from_text(response_text, default_step_index=default_step_index)
+    if not isinstance(payload, list):
+        raise ValueError("ROI extraction response must be a JSON list.")
+    evidences: List[GroundedEvidence] = []
+    for item in payload:
+        if not isinstance(item, dict):
+            continue
+        raw_bbox = item.get("bbox") or item.get("bbox_2d") or item.get("box")
+        if raw_bbox is None:
+            continue
+        try:
+            bbox = _normalize_bbox(raw_bbox)
+        except ValueError:
+            continue
+        step_index = item.get("step") or item.get("step_index") or default_step_index
+        if not isinstance(step_index, int):
+            step_index = default_step_index
+        description = item.get("description") or item.get("caption") or item.get("detail")
+        if isinstance(description, str):
+            description = description.strip() or None
+        else:
+            description = None
+        confidence = item.get("confidence") or item.get("score") or item.get("probability")
+        if isinstance(confidence, str):
+            confidence = confidence.strip()
+            confidence = float(confidence) if confidence else None
+        elif isinstance(confidence, (int, float)):
+            confidence = float(confidence)
+        else:
+            confidence = None
+        evidences.append(
+            GroundedEvidence(
+                step_index=step_index,
+                bbox=bbox,
+                description=description,
+                confidence=confidence,
+                raw_source=item,
+            )
+        )
+    return evidences

corgi/pipeline.py ADDED Viewed

	@@ -0,0 +1,92 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import List, Protocol
+from PIL import Image
+from .types import (
+    GroundedEvidence,
+    ReasoningStep,
+    evidences_to_serializable,
+    steps_to_serializable,
+)
+class SupportsQwenClient(Protocol):
+    """Protocol describing the methods required from a Qwen3-VL client."""
+    def structured_reasoning(self, image: Image.Image, question: str, max_steps: int) -> List[ReasoningStep]:
+        ...
+    def extract_step_evidence(
+        self,
+        image: Image.Image,
+        question: str,
+        step: ReasoningStep,
+        max_regions: int,
+    ) -> List[GroundedEvidence]:
+        ...
+    def synthesize_answer(
+        self,
+        image: Image.Image,
+        question: str,
+        steps: List[ReasoningStep],
+        evidences: List[GroundedEvidence],
+    ) -> str:
+        ...
+@dataclass(frozen=True)
+class PipelineResult:
+    """Aggregated output of the CoRGI pipeline."""
+    question: str
+    steps: List[ReasoningStep]
+    evidence: List[GroundedEvidence]
+    answer: str
+    def to_json(self) -> dict:
+        return {
+            "question": self.question,
+            "steps": steps_to_serializable(self.steps),
+            "evidence": evidences_to_serializable(self.evidence),
+            "answer": self.answer,
+        }
+class CoRGIPipeline:
+    """Orchestrates the CoRGI reasoning pipeline using a Qwen3-VL client."""
+    def __init__(self, vlm_client: SupportsQwenClient):
+        if vlm_client is None:
+            raise ValueError("A Qwen3-VL client instance must be provided.")
+        self._vlm = vlm_client
+    def run(
+        self,
+        image: Image.Image,
+        question: str,
+        max_steps: int = 4,
+        max_regions: int = 4,
+    ) -> PipelineResult:
+        steps = self._vlm.structured_reasoning(image=image, question=question, max_steps=max_steps)
+        evidences: List[GroundedEvidence] = []
+        for step in steps:
+            if not step.needs_vision:
+                continue
+            step_evs = self._vlm.extract_step_evidence(
+                image=image,
+                question=question,
+                step=step,
+                max_regions=max_regions,
+            )
+            if not step_evs:
+                continue
+            evidences.extend(step_evs[:max_regions])
+        answer = self._vlm.synthesize_answer(image=image, question=question, steps=steps, evidences=evidences)
+        return PipelineResult(question=question, steps=steps, evidence=evidences, answer=answer)
+__all__ = ["CoRGIPipeline", "PipelineResult"]

corgi/qwen_client.py ADDED Viewed

	@@ -0,0 +1,176 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import List, Optional
+import torch
+from PIL import Image
+from transformers import AutoModelForImageTextToText, AutoProcessor
+from .parsers import parse_roi_evidence, parse_structured_reasoning
+from .types import GroundedEvidence, ReasoningStep
+DEFAULT_REASONING_PROMPT = (
+    "You are a careful multimodal reasoner following the CoRGI protocol. "
+    "Given the question and the image, produce a JSON array of reasoning steps. "
+    "Each item must contain the keys: index (1-based integer), statement (concise sentence), "
+    "needs_vision (boolean true if the statement requires visual verification), and reason "
+    "(short phrase explaining why visual verification is or is not required). "
+    "Limit the number of steps to {max_steps}. Respond with JSON only; start the reply with '[' and end with ']'. "
+    "Do not add any commentary or prose outside of the JSON."
+)
+DEFAULT_GROUNDING_PROMPT = (
+    "You are validating the following reasoning step:\n"
+    "{step_statement}\n"
+    "Return a JSON array with up to {max_regions} region candidates that help verify the step. "
+    "Each object must include: step (integer), bbox (list of four numbers x1,y1,x2,y2, "
+    "either normalized 0-1 or scaled 0-1000), description (short textual evidence), "
+    "and confidence (0-1). Use [] if no relevant region exists. "
+    "Respond with JSON only; do not include explanations outside the JSON array."
+)
+DEFAULT_ANSWER_PROMPT = (
+    "You are finalizing the answer using verified evidence. "
+    "Question: {question}\n"
+    "Structured reasoning steps:\n"
+    "{steps}\n"
+    "Verified evidence items:\n"
+    "{evidence}\n"
+    "Respond with a concise final answer sentence grounded in the evidence. "
+    "If unsure, say you are uncertain. Do not include <think> tags or internal monologue."
+)
+def _format_steps_for_prompt(steps: List[ReasoningStep]) -> str:
+    return "\n".join(
+        f"{step.index}. {step.statement} (needs vision: {step.needs_vision})"
+        for step in steps
+    )
+def _format_evidence_for_prompt(evidences: List[GroundedEvidence]) -> str:
+    if not evidences:
+        return "No evidence collected."
+    lines = []
+    for ev in evidences:
+        desc = ev.description or "No description"
+        bbox = ", ".join(f"{coord:.2f}" for coord in ev.bbox)
+        conf = f"{ev.confidence:.2f}" if ev.confidence is not None else "n/a"
+        lines.append(f"Step {ev.step_index}: bbox=({bbox}), conf={conf}, desc={desc}")
+    return "\n".join(lines)
+def _strip_think_content(text: str) -> str:
+    if not text:
+        return ""
+    cleaned = text
+    if "</think>" in cleaned:
+        cleaned = cleaned.split("</think>", 1)[-1]
+    cleaned = cleaned.replace("<think>", "")
+    return cleaned.strip()
+@dataclass
+class QwenGenerationConfig:
+    model_id: str = "Qwen/Qwen3-VL-8B-Thinking"
+    max_new_tokens: int = 512
+    temperature: float | None = None
+    do_sample: bool = False
+class Qwen3VLClient:
+    """Wrapper around transformers Qwen3-VL chat API for CoRGI pipeline."""
+    def __init__(
+        self,
+        config: Optional[QwenGenerationConfig] = None,
+    ) -> None:
+        self.config = config or QwenGenerationConfig()
+        torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+        self._model = AutoModelForImageTextToText.from_pretrained(
+            self.config.model_id,
+            torch_dtype=torch_dtype,
+            device_map="auto",
+        )
+        self._processor = AutoProcessor.from_pretrained(self.config.model_id)
+    def _chat(
+        self,
+        image: Image.Image,
+        prompt: str,
+        max_new_tokens: Optional[int] = None,
+    ) -> str:
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image", "image": image},
+                    {"type": "text", "text": prompt},
+                ],
+            }
+        ]
+        chat_prompt = self._processor.apply_chat_template(
+            messages,
+            add_generation_prompt=True,
+            tokenize=False,
+        )
+        inputs = self._processor(
+            text=[chat_prompt],
+            images=[image],
+            return_tensors="pt",
+        ).to(self._model.device)
+        gen_kwargs = {
+            "max_new_tokens": max_new_tokens or self.config.max_new_tokens,
+            "do_sample": self.config.do_sample,
+        }
+        if self.config.do_sample and self.config.temperature is not None:
+            gen_kwargs["temperature"] = self.config.temperature
+        output_ids = self._model.generate(**inputs, **gen_kwargs)
+        prompt_length = inputs.input_ids.shape[1]
+        generated_tokens = output_ids[:, prompt_length:]
+        response = self._processor.batch_decode(
+            generated_tokens,
+            skip_special_tokens=True,
+            clean_up_tokenization_spaces=False,
+        )[0]
+        return response.strip()
+    def structured_reasoning(self, image: Image.Image, question: str, max_steps: int) -> List[ReasoningStep]:
+        prompt = DEFAULT_REASONING_PROMPT.format(max_steps=max_steps) + f"\nQuestion: {question}"
+        response = self._chat(image=image, prompt=prompt)
+        return parse_structured_reasoning(response, max_steps=max_steps)
+    def extract_step_evidence(
+        self,
+        image: Image.Image,
+        question: str,
+        step: ReasoningStep,
+        max_regions: int,
+    ) -> List[GroundedEvidence]:
+        prompt = DEFAULT_GROUNDING_PROMPT.format(
+            step_statement=step.statement,
+            max_regions=max_regions,
+        )
+        response = self._chat(image=image, prompt=prompt, max_new_tokens=256)
+        evidences = parse_roi_evidence(response, default_step_index=step.index)
+        return evidences[:max_regions]
+    def synthesize_answer(
+        self,
+        image: Image.Image,
+        question: str,
+        steps: List[ReasoningStep],
+        evidences: List[GroundedEvidence],
+    ) -> str:
+        prompt = DEFAULT_ANSWER_PROMPT.format(
+            question=question,
+            steps=_format_steps_for_prompt(steps),
+            evidence=_format_evidence_for_prompt(evidences),
+        )
+        response = self._chat(image=image, prompt=prompt, max_new_tokens=256)
+        return _strip_think_content(response)
+__all__ = ["Qwen3VLClient", "QwenGenerationConfig"]

corgi/types.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Tuple
+BBox = Tuple[float, float, float, float]
+@dataclass(frozen=True)
+class ReasoningStep:
+    """Represents a single structured reasoning step."""
+    index: int
+    statement: str
+    needs_vision: bool
+    reason: Optional[str] = None
+@dataclass(frozen=True)
+class GroundedEvidence:
+    """Evidence item grounded to a region of interest in the image."""
+    step_index: int
+    bbox: BBox
+    description: Optional[str] = None
+    confidence: Optional[float] = None
+    raw_source: Optional[Dict[str, object]] = None
+def steps_to_serializable(steps: List[ReasoningStep]) -> List[Dict[str, object]]:
+    """Helper to convert steps into JSON-friendly dictionaries."""
+    return [
+        {
+            "index": step.index,
+            "statement": step.statement,
+            "needs_vision": step.needs_vision,
+            **({"reason": step.reason} if step.reason is not None else {}),
+        }
+        for step in steps
+    ]
+def evidences_to_serializable(evidences: List[GroundedEvidence]) -> List[Dict[str, object]]:
+    """Helper to convert evidences into JSON-friendly dictionaries."""
+    serializable: List[Dict[str, object]] = []
+    for ev in evidences:
+        item: Dict[str, object] = {
+            "step_index": ev.step_index,
+            "bbox": list(ev.bbox),
+        }
+        if ev.description is not None:
+            item["description"] = ev.description
+        if ev.confidence is not None:
+            item["confidence"] = ev.confidence
+        if ev.raw_source is not None:
+            item["raw_source"] = ev.raw_source
+        serializable.append(item)
+    return serializable

examples/demo_qwen_corgi.py ADDED Viewed

	@@ -0,0 +1,85 @@

+#!/usr/bin/env python
+"""Run CoRGI pipeline on the Qwen3-VL demo image and question.
+Usage:
+    python examples/demo_qwen_corgi.py [--model-id Qwen/Qwen3-VL-8B-Thinking]
+If the demo image cannot be downloaded automatically, set the environment
+variable `CORGI_DEMO_IMAGE` to a local file path.
+"""
+from __future__ import annotations
+import argparse
+import os
+from io import BytesIO
+from pathlib import Path
+from urllib.request import urlopen
+from PIL import Image
+from corgi.pipeline import CoRGIPipeline
+from corgi.qwen_client import Qwen3VLClient, QwenGenerationConfig
+DEMO_IMAGE_URL = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
+DEMO_QUESTION = "How many people are there in the image? Is there any one who is wearing a white watch?"
+def fetch_demo_image() -> Image.Image:
+    if path := os.getenv("CORGI_DEMO_IMAGE"):
+        return Image.open(path).convert("RGB")
+    with urlopen(DEMO_IMAGE_URL) as resp:  # nosec B310 - trusted URL from official demo asset
+        data = resp.read()
+    return Image.open(BytesIO(data)).convert("RGB")
+def format_steps(pipeline_result) -> str:
+    lines = ["Reasoning steps:"]
+    for step in pipeline_result.steps:
+        needs = "yes" if step.needs_vision else "no"
+        reason = f" (reason: {step.reason})" if step.reason else ""
+        lines.append(f"  [{step.index}] {step.statement} — needs vision: {needs}{reason}")
+    return "\n".join(lines)
+def format_evidence(pipeline_result) -> str:
+    lines = ["Visual evidence:"]
+    if not pipeline_result.evidence:
+        lines.append("  (no evidence returned)")
+        return "\n".join(lines)
+    for ev in pipeline_result.evidence:
+        bbox = ", ".join(f"{coord:.2f}" for coord in ev.bbox)
+        desc = ev.description or "(no description)"
+        conf = f", conf={ev.confidence:.2f}" if ev.confidence is not None else ""
+        lines.append(f"  Step {ev.step_index}: bbox=({bbox}), desc={desc}{conf}")
+    return "\n".join(lines)
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Run CoRGI pipeline with the real Qwen3-VL model.")
+    parser.add_argument("--model-id", default="Qwen/Qwen3-VL-8B-Thinking", help="Hugging Face model id for Qwen3-VL")
+    parser.add_argument("--max-steps", type=int, default=4)
+    parser.add_argument("--max-regions", type=int, default=4)
+    args = parser.parse_args()
+    image = fetch_demo_image()
+    client = Qwen3VLClient(QwenGenerationConfig(model_id=args.model_id))
+    pipeline = CoRGIPipeline(client)
+    result = pipeline.run(
+        image=image,
+        question=DEMO_QUESTION,
+        max_steps=args.max_steps,
+        max_regions=args.max_regions,
+    )
+    print(f"Question: {DEMO_QUESTION}")
+    print(format_steps(result))
+    print(format_evidence(result))
+    print("Answer:", result.answer)
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+accelerate>=0.34
+transformers>=4.45
+pillow
+torch
+gradio>=4.44
+hydra-core
+antlr4-python3-runtime