dung-vpt-uney commited on
Commit
b6a01d6
·
1 Parent(s): b9f8b29

Deploy latest CoRGI Gradio demo

Browse files
PROGRESS_LOG.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CoRGI Custom Demo — Progress Log
2
+
3
+ > Keep this log short and chronological. Newest updates at the top.
4
+
5
+ ## 2024-10-22
6
+ - Reproduced the CoRGI pipeline failure with the real `Qwen/Qwen3-VL-8B-Thinking` checkpoint and traced it to reasoning outputs that only use ordinal step words.
7
+ - Taught the text parser to normalize “First/Second step” style markers into numeric indices, refreshed the unit tests to cover the new heuristic, and reran the demo/end-to-end pipeline successfully.
8
+ - Tidied Qwen generation settings to avoid unused temperature flags when running deterministically.
9
+ - Validated ROI extraction on a vision-heavy prompt against the real model and hardened prompts so responses stay in structured JSON without verbose preambles.
10
+ - Added meta-comment pruning so thinking-mode rambles (e.g., redundant “Step 3” reflections) are dropped while preserving genuine reasoning; confirmed with the official demo image that only meaningful steps remain.
11
+
12
+ ## 2024-10-21
13
+ - Updated default checkpoints to `Qwen/Qwen3-VL-8B-Thinking` and verified CLI/Gradio/test coverage.
14
+ - Exercised the real model to capture thinking-style outputs; added parser fallbacks for textual reasoning/ROI responses and stripped `<think>` tags from answer synthesis.
15
+ - Extended unit test suite (reasoning, ROI, client helpers) to cover the new parsing paths and ran `pytest` successfully.
16
+
17
+ ## 2024-10-20
18
+ - Added optional integration test (`corgi_tests/test_integration_qwen.py`) gated by `CORGI_RUN_QWEN_INTEGRATION` for running the real Qwen3-VL model on the official demo asset.
19
+ - Created runnable example script (`examples/demo_qwen_corgi.py`) to reproduce the Hugging Face demo prompt locally with structured pipeline logging.
20
+ - Published Hugging Face Space harness (`app.py`) and deployment helper (`scripts/push_space.sh`) including requirements for ZeroGPU tier.
21
+ - Documented cookbook alignment and inference tips (`QWEN_INFERENCE_NOTES.md`).
22
+ - Added CLI runner (`corgi.cli`) with formatting helpers plus JSON export; authored matching unittest coverage.
23
+ - Implemented Gradio demo harness (`corgi.gradio_app`) with markdown reporting and helper utilities for dependency injection.
24
+ - Expanded unit test suite (CLI + Gradio) and ran `pytest corgi_tests` successfully (1 skip when gradio missing).
25
+ - Initialized structured project plan and progress log scaffolding.
26
+ - Assessed existing modules (`corgi.pipeline`, `corgi.qwen_client`, parsers, tests) to identify pending demo features (CLI + Gradio).
27
+ - Confirmed Qwen3-VL will be the single backbone for reasoning, ROI verification, and answer synthesis.
28
+
29
+ <!-- Template for future updates:
30
+ ## YYYY-MM-DD
31
+ - Summary of change / milestone.
32
+ - Follow-up actions.
33
+ -->
PROJECT_PLAN.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CoRGI Custom Demo — Project Plan
2
+
3
+ ## Context
4
+ - **Objective**: ship a runnable CoRGI demo (CLI + Gradio) powered entirely by Qwen3-VL for structured reasoning, ROI evidence extraction, and answer synthesis.
5
+ - **Scope**: stay within the `corgi_custom` package, reuse Qwen3-VL cookbooks where possible, keep dependency footprint minimal (no extra detectors/rerankers).
6
+ - **Environment**: Conda env `pytorch`, default VLM `Qwen/Qwen3-VL-8B-Thinking`.
7
+
8
+ ## Milestones
9
+ | Status | Milestone | Notes |
10
+ | --- | --- | --- |
11
+ | ✅ | Core pipeline skeleton (dataclasses, parsers, Qwen client wrappers) | Already merged in repo. |
12
+ | ✅ | Project documentation & progress tracking scaffolding | Plan + progress log committed. |
13
+ | ✅ | CLI runner that prints step-by-step pipeline output | Supports overrides + JSON export. |
14
+ | ✅ | Gradio demo mirroring CLI functionality | Blocks UI with markdown report messaging. |
15
+ | ✅ | Automated tests for new modules | CLI + Gradio helpers covered with unit tests. |
16
+ | ✅ | HF Space deployment automation | Bash script + app harness for zerogpu Spaces. |
17
+ | 🟡 | Final verification (unit tests, smoke instructions) | Document how to run `pytest` and the demos. |
18
+
19
+ ## Work Breakdown Structure
20
+ 1. **Docs & Tracking**
21
+ - [x] Finalize plan and progress log templates.
22
+ - [x] Document environment setup expectations.
23
+ 2. **Pipeline UX**
24
+ - [x] Implement CLI entrypoint (`corgi.cli:main`).
25
+ - [x] Provide structured stdout for steps/evidence/answer.
26
+ - [x] Allow optional JSON dump for downstream tooling.
27
+ 3. **Interactive Demo**
28
+ - [x] Build Gradio app harness (image upload + question textbox).
29
+ - [ ] Stream progress (optional) and display textual reasoning/evidence.
30
+ - [x] Handle model loading errors gracefully.
31
+ 4. **Testing & Tooling**
32
+ - [x] Add fixture-friendly helpers to avoid heavy model loads in tests.
33
+ - [x] Write unit tests for CLI argument parsing + formatting.
34
+ - [ ] Add regression test for pipeline serialization.
35
+ 5. **Docs & Hand-off**
36
+ - [ ] Update README/demo instructions.
37
+ - [ ] Provide sample command sequences for CLI/Gradio.
38
+ - [ ] Capture open risks & future enhancements.
39
+ 6. **Deployment & Ops**
40
+ - [x] Add Hugging Face Space entrypoint (`app.py`).
41
+ - [x] Write deployment helper script (`scripts/push_space.sh`).
42
+ - [ ] Add automated checklists/logs for Space updates.
43
+
44
+ ## Risks & Mitigations
45
+ - **Model loading latency / VRAM** → expose config knobs and mention 4B fallback.
46
+ - **Parsing drift from Qwen outputs** → keep parser tolerant; add debug flag to dump raw responses.
47
+ - **Test runtime** → mock Qwen client via fixtures; avoid loading real model in unit tests.
48
+
49
+ ## Progress Tracking
50
+ - Refer to `PROGRESS_LOG.md` for dated status updates.
51
+ - Update milestone table whenever a deliverable completes.
QWEN_INFERENCE_NOTES.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Qwen3-VL Cookbook Alignment
2
+
3
+ This project mirrors the official Qwen3-VL cookbook patterns (see `../Qwen3-VL/cookbooks`) when running the CoRGI pipeline with the real model.
4
+
5
+ ## Key Parallels
6
+ - **Model + Processor loading**: We rely on `AutoModelForImageTextToText` and `AutoProcessor` exactly as described in the main README and cookbook notebooks such as `think_with_images.ipynb`.
7
+ - **Chat template**: `Qwen3VLClient` uses `processor.apply_chat_template(..., add_generation_prompt=True)` before calling `generate`, which matches the recommended multi-turn messaging flow.
8
+ - **Image transport**: Both the pipeline and demo scripts accept PIL images and ensure conversion to RGB prior to inference, mirroring cookbook utilities that normalize channels.
9
+ - **Max tokens & decoding**: Default `max_new_tokens=512` and `temperature=0.2` align with cookbook demos favouring deterministic outputs for evaluation.
10
+ - **Single-model pipeline**: All stages (reasoning, ROI extraction, answer synthesis) are executed by the same Qwen3-VL instance, following the cookbook philosophy of leveraging the model’s intrinsic grounding capability without external detectors.
11
+
12
+ ## Practical Tips for Local Inference
13
+ - Use the `pytorch` Conda env with the latest `transformers` (>=4.45) to access `AutoModelForImageTextToText` support, as advised in the cookbook README.
14
+ - When VRAM is limited, switch to `Qwen/Qwen3-VL-4B-Instruct` via `--model-id` or `CORGI_QWEN_MODEL` environment variable—no other code changes needed.
15
+ - The integration test (`corgi_tests/test_integration_qwen.py`) and demo (`examples/demo_qwen_corgi.py`) download the official demo image if `CORGI_DEMO_IMAGE` is not supplied, matching cookbook notebooks that reference the same asset URL.
16
+ - For reproducibility, set `HF_HOME` (or use the cookbook’s `snapshot_download`) to manage local caches and avoid repeated downloads.
17
+ - The `Qwen/Qwen3-VL-8B-Thinking` checkpoint often emits free-form “thinking” text instead of JSON; the pipeline now falls back to parsing those narratives for step and ROI extraction, and strips `<think>…</think>` scaffolding from final answers.
18
+
19
+ These notes ensure our CoRGI adaptation stays consistent with the official Qwen workflow while keeping the codebase modular for experimentation.
README.md CHANGED
@@ -1,12 +1,13 @@
1
- ---
2
- title: Corgi Qwen3 Vl Demo
3
- emoji: 😻
4
- colorFrom: green
5
- colorTo: indigo
6
- sdk: gradio
7
- sdk_version: 5.49.1
8
- app_file: app.py
9
- pinned: false
10
- ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
1
+ # CoRGI Qwen3-VL Demo
 
 
 
 
 
 
 
 
 
2
 
3
+ This Space hosts the CoRGI reasoning pipeline backed by the Qwen/Qwen3-VL-8B-Thinking model.
4
+
5
+ ## Run Locally
6
+ ```
7
+ pip install -r requirements.txt
8
+ python examples/demo_qwen_corgi.py
9
+ ```
10
+
11
+ ## Notes
12
+ - The demo queues requests sequentially (ZeroGPU/cpu-basic hardware).
13
+ - Configure `CORGI_QWEN_MODEL` to switch to a different checkpoint.
app.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ """Hugging Face Spaces entrypoint for the CoRGI Qwen3-VL demo."""
2
+
3
+ from corgi.gradio_app import build_demo
4
+
5
+
6
+ demo = build_demo()
7
+ demo.queue(concurrency_count=1)
8
+
9
+ if __name__ == "__main__":
10
+ demo.launch()
corgi/__init__.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """CoRGI pipeline package using Qwen3-VL."""
2
+
3
+ from .pipeline import CoRGIPipeline, PipelineResult
4
+ from .types import GroundedEvidence, ReasoningStep
5
+
6
+ __all__ = [
7
+ "CoRGIPipeline",
8
+ "PipelineResult",
9
+ "GroundedEvidence",
10
+ "ReasoningStep",
11
+ ]
corgi/__pycache__/__init__.cpython-310.pyc ADDED
Binary file (413 Bytes). View file
 
corgi/__pycache__/__init__.cpython-312.pyc ADDED
Binary file (419 Bytes). View file
 
corgi/__pycache__/__init__.cpython-313.pyc ADDED
Binary file (419 Bytes). View file
 
corgi/__pycache__/cli.cpython-312.pyc ADDED
Binary file (6.41 kB). View file
 
corgi/__pycache__/cli.cpython-313.pyc ADDED
Binary file (6.39 kB). View file
 
corgi/__pycache__/gradio_app.cpython-312.pyc ADDED
Binary file (8.03 kB). View file
 
corgi/__pycache__/gradio_app.cpython-313.pyc ADDED
Binary file (8.24 kB). View file
 
corgi/__pycache__/parsers.cpython-310.pyc ADDED
Binary file (4.61 kB). View file
 
corgi/__pycache__/parsers.cpython-312.pyc ADDED
Binary file (18.1 kB). View file
 
corgi/__pycache__/parsers.cpython-313.pyc ADDED
Binary file (18.8 kB). View file
 
corgi/__pycache__/pipeline.cpython-310.pyc ADDED
Binary file (3.13 kB). View file
 
corgi/__pycache__/pipeline.cpython-312.pyc ADDED
Binary file (3.86 kB). View file
 
corgi/__pycache__/pipeline.cpython-313.pyc ADDED
Binary file (3.97 kB). View file
 
corgi/__pycache__/qwen_client.cpython-312.pyc ADDED
Binary file (9.01 kB). View file
 
corgi/__pycache__/qwen_client.cpython-313.pyc ADDED
Binary file (9.13 kB). View file
 
corgi/__pycache__/types.cpython-310.pyc ADDED
Binary file (2.16 kB). View file
 
corgi/__pycache__/types.cpython-312.pyc ADDED
Binary file (2.67 kB). View file
 
corgi/__pycache__/types.cpython-313.pyc ADDED
Binary file (2.77 kB). View file
 
corgi/cli.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ import json
5
+ import sys
6
+ from pathlib import Path
7
+ from typing import Callable, Optional, TextIO
8
+
9
+ from PIL import Image
10
+
11
+ from .pipeline import CoRGIPipeline
12
+ from .qwen_client import Qwen3VLClient, QwenGenerationConfig
13
+ from .types import GroundedEvidence, ReasoningStep
14
+
15
+ DEFAULT_MODEL_ID = "Qwen/Qwen3-VL-8B-Thinking"
16
+
17
+
18
+ def build_parser() -> argparse.ArgumentParser:
19
+ parser = argparse.ArgumentParser(
20
+ prog="corgi-cli",
21
+ description="Run the CoRGI reasoning pipeline over an image/question pair.",
22
+ )
23
+ parser.add_argument("--image", type=Path, required=True, help="Path to the input image (jpg/png/etc.)")
24
+ parser.add_argument("--question", type=str, required=True, help="Visual question for the image")
25
+ parser.add_argument("--max-steps", type=int, default=4, help="Maximum number of reasoning steps to request")
26
+ parser.add_argument(
27
+ "--max-regions",
28
+ type=int,
29
+ default=4,
30
+ help="Maximum number of grounded regions per visual step",
31
+ )
32
+ parser.add_argument(
33
+ "--model-id",
34
+ type=str,
35
+ default=None,
36
+ help="Optional override for the Qwen3-VL model identifier",
37
+ )
38
+ parser.add_argument(
39
+ "--json-out",
40
+ type=Path,
41
+ default=None,
42
+ help="Optional path to write the pipeline result as JSON",
43
+ )
44
+ return parser
45
+
46
+
47
+ def _format_step(step: ReasoningStep) -> str:
48
+ needs = "yes" if step.needs_vision else "no"
49
+ suffix = f"; reason: {step.reason}" if step.reason else ""
50
+ return f"[{step.index}] {step.statement} (needs vision: {needs}{suffix})"
51
+
52
+
53
+ def _format_evidence_item(evidence: GroundedEvidence) -> str:
54
+ bbox = ", ".join(f"{coord:.2f}" for coord in evidence.bbox)
55
+ parts = [f"Step {evidence.step_index} | bbox=({bbox})"]
56
+ if evidence.description:
57
+ parts.append(f"desc: {evidence.description}")
58
+ if evidence.confidence is not None:
59
+ parts.append(f"conf: {evidence.confidence:.2f}")
60
+ return " | ".join(parts)
61
+
62
+
63
+ def _default_pipeline_factory(model_id: Optional[str]) -> CoRGIPipeline:
64
+ config = QwenGenerationConfig(model_id=model_id or DEFAULT_MODEL_ID)
65
+ client = Qwen3VLClient(config=config)
66
+ return CoRGIPipeline(vlm_client=client)
67
+
68
+
69
+ def execute_cli(
70
+ *,
71
+ image_path: Path,
72
+ question: str,
73
+ max_steps: int,
74
+ max_regions: int,
75
+ model_id: Optional[str],
76
+ json_out: Optional[Path],
77
+ pipeline_factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
78
+ output_stream: TextIO | None = None,
79
+ ) -> None:
80
+ if output_stream is None:
81
+ output_stream = sys.stdout
82
+ factory = pipeline_factory or _default_pipeline_factory
83
+
84
+ with Image.open(image_path) as img:
85
+ image = img.convert("RGB")
86
+ pipeline = factory(model_id)
87
+ result = pipeline.run(
88
+ image=image,
89
+ question=question,
90
+ max_steps=max_steps,
91
+ max_regions=max_regions,
92
+ )
93
+
94
+ print(f"Question: {question}", file=output_stream)
95
+ print("-- Steps --", file=output_stream)
96
+ for step in result.steps:
97
+ print(_format_step(step), file=output_stream)
98
+ if not result.steps:
99
+ print("(no reasoning steps returned)", file=output_stream)
100
+
101
+ print("-- Evidence --", file=output_stream)
102
+ if result.evidence:
103
+ for evidence in result.evidence:
104
+ print(_format_evidence_item(evidence), file=output_stream)
105
+ else:
106
+ print("(no visual evidence)", file=output_stream)
107
+
108
+ print("-- Answer --", file=output_stream)
109
+ print(f"Answer: {result.answer}", file=output_stream)
110
+
111
+ if json_out is not None:
112
+ json_out.parent.mkdir(parents=True, exist_ok=True)
113
+ with json_out.open("w", encoding="utf-8") as handle:
114
+ json.dump(result.to_json(), handle, ensure_ascii=False, indent=2)
115
+
116
+
117
+ def main(argv: Optional[list[str]] = None) -> int:
118
+ parser = build_parser()
119
+ args = parser.parse_args(argv)
120
+ execute_cli(
121
+ image_path=args.image,
122
+ question=args.question,
123
+ max_steps=args.max_steps,
124
+ max_regions=args.max_regions,
125
+ model_id=args.model_id,
126
+ json_out=args.json_out,
127
+ )
128
+ return 0
129
+
130
+
131
+ __all__ = ["build_parser", "execute_cli", "main"]
corgi/gradio_app.py ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass
4
+ from typing import Callable, Optional
5
+
6
+ from PIL import Image
7
+
8
+ from .cli import DEFAULT_MODEL_ID
9
+ from .pipeline import CoRGIPipeline, PipelineResult
10
+ from .qwen_client import Qwen3VLClient, QwenGenerationConfig
11
+
12
+
13
+ @dataclass
14
+ class PipelineState:
15
+ model_id: str
16
+ pipeline: Optional[CoRGIPipeline]
17
+
18
+
19
+ def _default_factory(model_id: Optional[str]) -> CoRGIPipeline:
20
+ config = QwenGenerationConfig(model_id=model_id or DEFAULT_MODEL_ID)
21
+ return CoRGIPipeline(vlm_client=Qwen3VLClient(config=config))
22
+
23
+
24
+ def ensure_pipeline_state(
25
+ previous: Optional[PipelineState],
26
+ model_id: Optional[str],
27
+ factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
28
+ ) -> PipelineState:
29
+ target_model = model_id or DEFAULT_MODEL_ID
30
+ factory = factory or _default_factory
31
+ if previous is not None and previous.model_id == target_model:
32
+ return previous
33
+ pipeline = factory(target_model)
34
+ return PipelineState(model_id=target_model, pipeline=pipeline)
35
+
36
+
37
+ def format_result_markdown(result: PipelineResult) -> str:
38
+ lines: list[str] = []
39
+ lines.append("### Answer")
40
+ lines.append(result.answer or "(no answer returned)")
41
+ lines.append("")
42
+ lines.append("### Reasoning Steps")
43
+ if result.steps:
44
+ for step in result.steps:
45
+ needs = "yes" if step.needs_vision else "no"
46
+ reason = f" — {step.reason}" if step.reason else ""
47
+ lines.append(f"- **Step {step.index}**: {step.statement} _(needs vision: {needs})_{reason}")
48
+ else:
49
+ lines.append("- No reasoning steps returned.")
50
+ lines.append("")
51
+ lines.append("### Visual Evidence")
52
+ if result.evidence:
53
+ for ev in result.evidence:
54
+ bbox = ", ".join(f"{coord:.2f}" for coord in ev.bbox)
55
+ desc = ev.description or "(no description)"
56
+ conf = f" — confidence {ev.confidence:.2f}" if ev.confidence is not None else ""
57
+ lines.append(f"- Step {ev.step_index}: bbox=({bbox}) — {desc}{conf}")
58
+ else:
59
+ lines.append("- No visual evidence collected.")
60
+ return "\n".join(lines)
61
+
62
+
63
+ def _run_pipeline(
64
+ state: Optional[PipelineState],
65
+ image: Image.Image | None,
66
+ question: str,
67
+ max_steps: int,
68
+ max_regions: int,
69
+ model_id: Optional[str],
70
+ factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
71
+ ) -> tuple[PipelineState, str]:
72
+ if image is None:
73
+ return state or PipelineState(model_id=model_id or DEFAULT_MODEL_ID, pipeline=None), "Please provide an image before running the demo."
74
+ if not question.strip():
75
+ return state or PipelineState(model_id=model_id or DEFAULT_MODEL_ID, pipeline=None), "Please enter a question before running the demo."
76
+ new_state = ensure_pipeline_state(state if state and state.pipeline else None, model_id, factory)
77
+ result = new_state.pipeline.run(
78
+ image=image.convert("RGB"),
79
+ question=question.strip(),
80
+ max_steps=int(max_steps),
81
+ max_regions=int(max_regions),
82
+ )
83
+ markdown = format_result_markdown(result)
84
+ return new_state, markdown
85
+
86
+
87
+ def build_demo(
88
+ pipeline_factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
89
+ ) -> "gradio.Blocks":
90
+ try:
91
+ import gradio as gr
92
+ except ImportError as exc: # pragma: no cover - exercised when gradio missing
93
+ raise RuntimeError("Gradio is required to build the demo. Install gradio>=4.0.") from exc
94
+
95
+ factory = pipeline_factory or _default_factory
96
+
97
+ with gr.Blocks(title="CoRGI Qwen3-VL Demo") as demo:
98
+ state = gr.State() # stores PipelineState
99
+
100
+ with gr.Row():
101
+ with gr.Column(scale=1, min_width=320):
102
+ image_input = gr.Image(label="Input image", type="pil")
103
+ question_input = gr.Textbox(label="Question", placeholder="What is happening in the image?", lines=2)
104
+ model_id_input = gr.Textbox(
105
+ label="Model ID",
106
+ value=DEFAULT_MODEL_ID,
107
+ placeholder="Leave blank to use default",
108
+ )
109
+ max_steps_slider = gr.Slider(
110
+ label="Max reasoning steps",
111
+ minimum=1,
112
+ maximum=6,
113
+ step=1,
114
+ value=4,
115
+ )
116
+ max_regions_slider = gr.Slider(
117
+ label="Max regions per step",
118
+ minimum=1,
119
+ maximum=6,
120
+ step=1,
121
+ value=4,
122
+ )
123
+ run_button = gr.Button("Run CoRGI")
124
+
125
+ with gr.Column(scale=1, min_width=320):
126
+ result_markdown = gr.Markdown(value="Upload an image and ask a question to begin.")
127
+
128
+ def _on_submit(state_data, image, question, model_id, max_steps, max_regions):
129
+ pipeline_state = state_data if isinstance(state_data, PipelineState) else None
130
+ new_state, markdown = _run_pipeline(
131
+ pipeline_state,
132
+ image,
133
+ question,
134
+ int(max_steps),
135
+ int(max_regions),
136
+ model_id if model_id else None,
137
+ factory,
138
+ )
139
+ return new_state, markdown
140
+
141
+ run_button.click(
142
+ fn=_on_submit,
143
+ inputs=[state, image_input, question_input, model_id_input, max_steps_slider, max_regions_slider],
144
+ outputs=[state, result_markdown],
145
+ )
146
+
147
+ return demo
148
+
149
+
150
+ def launch_demo(
151
+ *,
152
+ pipeline_factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
153
+ **launch_kwargs,
154
+ ) -> None:
155
+ demo = build_demo(pipeline_factory=pipeline_factory)
156
+ demo.launch(**launch_kwargs)
157
+
158
+
159
+ __all__ = [
160
+ "PipelineState",
161
+ "ensure_pipeline_state",
162
+ "format_result_markdown",
163
+ "build_demo",
164
+ "launch_demo",
165
+ "DEFAULT_MODEL_ID",
166
+ ]
corgi/parsers.py ADDED
@@ -0,0 +1,390 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import re
5
+ from typing import Any, Iterable, List
6
+
7
+ from .types import GroundedEvidence, ReasoningStep
8
+
9
+
10
+ _JSON_FENCE_RE = re.compile(r"```(?:json)?(.*?)```", re.DOTALL | re.IGNORECASE)
11
+ _STEP_MARKER_RE = re.compile(r"(?im)(?:^|\n)\s*(?:step\s*(\d+)|(\d+)[\.\)])\s*[:\-]?\s*")
12
+ _NEEDS_VISION_RE = re.compile(
13
+ r"needs[\s_]*vision\s*[:\-]?\s*(?P<value>true|false|yes|no|required|not required|necessary|unnecessary)",
14
+ re.IGNORECASE,
15
+ )
16
+ _REASON_RE = re.compile(r"reason\s*[:\-]\s*(?P<value>.+)", re.IGNORECASE)
17
+ _BOX_RE = re.compile(
18
+ r"\[\s*-?\d+(?:\.\d+)?\s*,\s*-?\d+(?:\.\d+)?\s*,\s*-?\d+(?:\.\d+)?\s*,\s*-?\d+(?:\.\d+)?\s*\]"
19
+ )
20
+
21
+ _ORDINAL_WORD_MAP = {
22
+ "first": 1,
23
+ "second": 2,
24
+ "third": 3,
25
+ "fourth": 4,
26
+ "fifth": 5,
27
+ "sixth": 6,
28
+ "seventh": 7,
29
+ "eighth": 8,
30
+ "ninth": 9,
31
+ "tenth": 10,
32
+ }
33
+
34
+ _NUMBER_WORD_MAP = {
35
+ "one": 1,
36
+ "two": 2,
37
+ "three": 3,
38
+ "four": 4,
39
+ "five": 5,
40
+ "six": 6,
41
+ "seven": 7,
42
+ "eight": 8,
43
+ "nine": 9,
44
+ "ten": 10,
45
+ }
46
+
47
+ _ORDINAL_STEP_RE = re.compile(
48
+ r"(?im)\b(?P<word>first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth)\s+step\b"
49
+ )
50
+ _WORD_STEP_RE = re.compile(
51
+ r"(?im)\bstep\s+(?P<word>one|two|three|four|five|six|seven|eight|nine|ten)\b"
52
+ )
53
+
54
+ _META_TOKENS = {"maybe", "wait", "let's", "lets", "question", "protocol"}
55
+
56
+
57
+ def _to_bool(value: Any) -> bool:
58
+ if isinstance(value, bool):
59
+ return value
60
+ if value is None:
61
+ return False
62
+ if isinstance(value, (int, float)):
63
+ return value != 0
64
+ if isinstance(value, str):
65
+ lowered = value.strip().lower()
66
+ if lowered in {"true", "t", "yes", "y", "1"}:
67
+ return True
68
+ if lowered in {"false", "f", "no", "n", "0"}:
69
+ return False
70
+ return False
71
+
72
+
73
+ def _extract_json_strings(text: str) -> Iterable[str]:
74
+ """Return candidate JSON payloads from the response text."""
75
+
76
+ fenced = _JSON_FENCE_RE.findall(text)
77
+ if fenced:
78
+ for body in fenced:
79
+ yield body.strip()
80
+ stripped = text.strip()
81
+ if stripped:
82
+ yield stripped
83
+
84
+
85
+ def _load_first_json(text: str) -> Any:
86
+ last_error = None
87
+ for candidate in _extract_json_strings(text):
88
+ try:
89
+ return json.loads(candidate)
90
+ except json.JSONDecodeError as err:
91
+ last_error = err
92
+ continue
93
+ if last_error:
94
+ raise ValueError(f"Unable to parse JSON from response: {last_error}") from last_error
95
+ raise ValueError("Empty response, cannot parse JSON.")
96
+
97
+
98
+ def _trim_reasoning_text(text: str) -> str:
99
+ lowered = text.lower()
100
+ for anchor in ("let's draft", "draft:", "structured steps", "final reasoning"):
101
+ pos = lowered.rfind(anchor)
102
+ if pos != -1:
103
+ return text[pos:]
104
+ return text
105
+
106
+
107
+ def _clean_sentence(text: str) -> str:
108
+ return " ".join(text.strip().split())
109
+
110
+
111
+ def _normalize_step_markers(text: str) -> str:
112
+ """Convert ordinal step markers into numeric form (e.g., 'First step' -> 'Step 1')."""
113
+
114
+ def replace_ordinal(match: re.Match[str]) -> str:
115
+ word = match.group("word").lower()
116
+ num = _ORDINAL_WORD_MAP.get(word)
117
+ return f"Step {num}" if num is not None else match.group(0)
118
+
119
+ def replace_word_number(match: re.Match[str]) -> str:
120
+ word = match.group("word").lower()
121
+ num = _NUMBER_WORD_MAP.get(word)
122
+ return f"Step {num}" if num is not None else match.group(0)
123
+
124
+ normalized = _ORDINAL_STEP_RE.sub(replace_ordinal, text)
125
+ normalized = _WORD_STEP_RE.sub(replace_word_number, normalized)
126
+ return normalized
127
+
128
+
129
+ def _extract_statement(body: str) -> str | None:
130
+ statement_match = re.search(r"statement\s*[:\-]\s*(.+)", body, re.IGNORECASE)
131
+ candidate = statement_match.group(1) if statement_match else body
132
+ # Remove trailing sections that describe vision or reason metadata.
133
+ candidate = re.split(r"(?i)needs\s*vision|reason\s*[:\-]", candidate)[0]
134
+ candidate = candidate.strip().strip(".")
135
+ if not candidate:
136
+ return None
137
+ return _clean_sentence(candidate)
138
+
139
+
140
+ def _extract_needs_vision(body: str) -> bool:
141
+ match = _NEEDS_VISION_RE.search(body)
142
+ if not match:
143
+ return True
144
+ token = match.group("value").strip().lower()
145
+ if token in {"not required", "unnecessary"}:
146
+ return False
147
+ if token in {"required", "necessary"}:
148
+ return True
149
+ return _to_bool(token)
150
+
151
+
152
+ def _extract_reason(body: str) -> str | None:
153
+ match = _REASON_RE.search(body)
154
+ if match:
155
+ reason = match.group("value").strip()
156
+ reason = re.split(r"(?i)needs\s*vision", reason)[0].strip()
157
+ reason = reason.rstrip(".")
158
+ return reason or None
159
+ because_match = re.search(r"because\s+(.+?)(?:\.|$)", body, re.IGNORECASE)
160
+ if because_match:
161
+ reason = because_match.group(1).strip().rstrip(".")
162
+ return reason or None
163
+ return None
164
+
165
+
166
+ def _parse_step_block(index_guess: int, body: str) -> ReasoningStep | None:
167
+ statement = _extract_statement(body)
168
+ if not statement:
169
+ return None
170
+ needs_vision = _extract_needs_vision(body)
171
+ reason = _extract_reason(body)
172
+ index = index_guess if index_guess > 0 else 1
173
+ return ReasoningStep(index=index, statement=statement, needs_vision=needs_vision, reason=reason)
174
+
175
+
176
+ def _parse_reasoning_from_text(response_text: str, max_steps: int) -> List[ReasoningStep]:
177
+ text = _trim_reasoning_text(response_text)
178
+ text = _normalize_step_markers(text)
179
+ matches = list(_STEP_MARKER_RE.finditer(text))
180
+ if not matches:
181
+ return []
182
+ steps_map: dict[int, ReasoningStep] = {}
183
+ ordering: List[int] = []
184
+ fallback_index = 1
185
+ for idx, marker in enumerate(matches):
186
+ start = marker.end()
187
+ end = matches[idx + 1].start() if idx + 1 < len(matches) else len(text)
188
+ body = text[start:end].strip()
189
+ if not body:
190
+ continue
191
+ raw_index = marker.group(1) or marker.group(2)
192
+ try:
193
+ index_guess = int(raw_index) if raw_index else fallback_index
194
+ except (TypeError, ValueError):
195
+ index_guess = fallback_index
196
+ if raw_index is None:
197
+ fallback_index += 1
198
+ step = _parse_step_block(index_guess, body)
199
+ if step is None:
200
+ continue
201
+ if step.index not in steps_map:
202
+ ordering.append(step.index)
203
+ steps_map[step.index] = step
204
+ if len(ordering) >= max_steps:
205
+ break
206
+ return [steps_map[idx] for idx in ordering[:max_steps]]
207
+
208
+
209
+ def _looks_like_meta_statement(statement: str) -> bool:
210
+ lowered = statement.lower()
211
+ if any(token in lowered for token in _META_TOKENS) and "step" in lowered:
212
+ return True
213
+ if lowered.startswith(("maybe", "wait", "let's", "lets")):
214
+ return True
215
+ if len(statement) > 260 and "step" in lowered:
216
+ return True
217
+ return False
218
+
219
+
220
+ def _prune_steps(steps: List[ReasoningStep]) -> List[ReasoningStep]:
221
+ filtered: List[ReasoningStep] = []
222
+ seen_statements: set[str] = set()
223
+ for step in steps:
224
+ normalized = step.statement.strip().lower()
225
+ if _looks_like_meta_statement(step.statement):
226
+ continue
227
+ if normalized in seen_statements:
228
+ continue
229
+ seen_statements.add(normalized)
230
+ filtered.append(step)
231
+ return filtered or steps
232
+
233
+
234
+ def _extract_description(text: str, start_index: int) -> str | None:
235
+ boundary = max(text.rfind("\n", 0, start_index), text.rfind(".", 0, start_index))
236
+ if boundary == -1:
237
+ boundary = 0
238
+ snippet = text[boundary:start_index].strip(" \n.:–-")
239
+ if not snippet:
240
+ return None
241
+ return _clean_sentence(snippet)
242
+
243
+
244
+ def _parse_roi_from_text(response_text: str, default_step_index: int) -> List[GroundedEvidence]:
245
+ evidences: List[GroundedEvidence] = []
246
+ seen: set[tuple[float, float, float, float]] = set()
247
+ for match in _BOX_RE.finditer(response_text):
248
+ coords_str = match.group(0).strip("[]")
249
+ try:
250
+ coords = [float(part.strip()) for part in coords_str.split(",")]
251
+ except ValueError:
252
+ continue
253
+ if len(coords) != 4:
254
+ continue
255
+ try:
256
+ bbox = _normalize_bbox(coords)
257
+ except ValueError:
258
+ continue
259
+ key = tuple(round(c, 4) for c in bbox)
260
+ if key in seen:
261
+ continue
262
+ description = _extract_description(response_text, match.start())
263
+ evidences.append(
264
+ GroundedEvidence(
265
+ step_index=default_step_index,
266
+ bbox=bbox,
267
+ description=description,
268
+ confidence=None,
269
+ raw_source={"bbox": coords, "description": description},
270
+ )
271
+ )
272
+ seen.add(key)
273
+ return evidences
274
+
275
+
276
+ def parse_structured_reasoning(response_text: str, max_steps: int) -> List[ReasoningStep]:
277
+ """Parse Qwen3-VL structured reasoning output into dataclasses."""
278
+
279
+ try:
280
+ payload = _load_first_json(response_text)
281
+ except ValueError as json_error:
282
+ steps = _parse_reasoning_from_text(response_text, max_steps=max_steps)
283
+ if steps:
284
+ return _prune_steps(steps)[:max_steps]
285
+ raise json_error
286
+ if not isinstance(payload, list):
287
+ raise ValueError("Structured reasoning response must be a JSON list.")
288
+
289
+ steps: List[ReasoningStep] = []
290
+ for idx, item in enumerate(payload, start=1):
291
+ if not isinstance(item, dict):
292
+ continue
293
+ statement = item.get("statement") or item.get("step") or item.get("text")
294
+ if not isinstance(statement, str):
295
+ continue
296
+ statement = statement.strip()
297
+ if not statement:
298
+ continue
299
+ step_index = item.get("index")
300
+ if not isinstance(step_index, int):
301
+ step_index = idx
302
+ needs_vision = _to_bool(item.get("needs_vision") or item.get("requires_vision"))
303
+ reason = item.get("reason") or item.get("justification")
304
+ if isinstance(reason, str):
305
+ reason = reason.strip() or None
306
+ else:
307
+ reason = None
308
+ steps.append(ReasoningStep(index=step_index, statement=statement, needs_vision=needs_vision, reason=reason))
309
+ if len(steps) >= max_steps:
310
+ break
311
+ steps = _prune_steps(steps)[:max_steps]
312
+ if not steps:
313
+ raise ValueError("No reasoning steps parsed from response.")
314
+ return steps
315
+
316
+
317
+ def _normalize_bbox(bbox: Any) -> tuple[float, float, float, float]:
318
+ if not isinstance(bbox, (list, tuple)) or len(bbox) != 4:
319
+ raise ValueError(f"Bounding box must be a list of 4 numbers, got {bbox!r}")
320
+ coords = []
321
+ for raw in bbox:
322
+ if isinstance(raw, str):
323
+ raw = raw.strip()
324
+ if not raw:
325
+ raw = 0
326
+ else:
327
+ raw = float(raw)
328
+ elif isinstance(raw, (int, float)):
329
+ raw = float(raw)
330
+ else:
331
+ raw = 0.0
332
+ coords.append(raw)
333
+ scale = max(abs(v) for v in coords) if coords else 1.0
334
+ if scale > 1.5: # assume 0..1000 or pixel coordinates
335
+ coords = [max(0.0, min(v / 1000.0, 1.0)) for v in coords]
336
+ else:
337
+ coords = [max(0.0, min(v, 1.0)) for v in coords]
338
+ x1, y1, x2, y2 = coords
339
+ x_min, x_max = sorted((x1, x2))
340
+ y_min, y_max = sorted((y1, y2))
341
+ return (x_min, y_min, x_max, y_max)
342
+
343
+
344
+ def parse_roi_evidence(response_text: str, default_step_index: int) -> List[GroundedEvidence]:
345
+ """Parse ROI grounding output into evidence structures."""
346
+
347
+ try:
348
+ payload = _load_first_json(response_text)
349
+ except ValueError:
350
+ return _parse_roi_from_text(response_text, default_step_index=default_step_index)
351
+ if not isinstance(payload, list):
352
+ raise ValueError("ROI extraction response must be a JSON list.")
353
+
354
+ evidences: List[GroundedEvidence] = []
355
+ for item in payload:
356
+ if not isinstance(item, dict):
357
+ continue
358
+ raw_bbox = item.get("bbox") or item.get("bbox_2d") or item.get("box")
359
+ if raw_bbox is None:
360
+ continue
361
+ try:
362
+ bbox = _normalize_bbox(raw_bbox)
363
+ except ValueError:
364
+ continue
365
+ step_index = item.get("step") or item.get("step_index") or default_step_index
366
+ if not isinstance(step_index, int):
367
+ step_index = default_step_index
368
+ description = item.get("description") or item.get("caption") or item.get("detail")
369
+ if isinstance(description, str):
370
+ description = description.strip() or None
371
+ else:
372
+ description = None
373
+ confidence = item.get("confidence") or item.get("score") or item.get("probability")
374
+ if isinstance(confidence, str):
375
+ confidence = confidence.strip()
376
+ confidence = float(confidence) if confidence else None
377
+ elif isinstance(confidence, (int, float)):
378
+ confidence = float(confidence)
379
+ else:
380
+ confidence = None
381
+ evidences.append(
382
+ GroundedEvidence(
383
+ step_index=step_index,
384
+ bbox=bbox,
385
+ description=description,
386
+ confidence=confidence,
387
+ raw_source=item,
388
+ )
389
+ )
390
+ return evidences
corgi/pipeline.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass
4
+ from typing import List, Protocol
5
+
6
+ from PIL import Image
7
+
8
+ from .types import (
9
+ GroundedEvidence,
10
+ ReasoningStep,
11
+ evidences_to_serializable,
12
+ steps_to_serializable,
13
+ )
14
+
15
+
16
+ class SupportsQwenClient(Protocol):
17
+ """Protocol describing the methods required from a Qwen3-VL client."""
18
+
19
+ def structured_reasoning(self, image: Image.Image, question: str, max_steps: int) -> List[ReasoningStep]:
20
+ ...
21
+
22
+ def extract_step_evidence(
23
+ self,
24
+ image: Image.Image,
25
+ question: str,
26
+ step: ReasoningStep,
27
+ max_regions: int,
28
+ ) -> List[GroundedEvidence]:
29
+ ...
30
+
31
+ def synthesize_answer(
32
+ self,
33
+ image: Image.Image,
34
+ question: str,
35
+ steps: List[ReasoningStep],
36
+ evidences: List[GroundedEvidence],
37
+ ) -> str:
38
+ ...
39
+
40
+
41
+ @dataclass(frozen=True)
42
+ class PipelineResult:
43
+ """Aggregated output of the CoRGI pipeline."""
44
+
45
+ question: str
46
+ steps: List[ReasoningStep]
47
+ evidence: List[GroundedEvidence]
48
+ answer: str
49
+
50
+ def to_json(self) -> dict:
51
+ return {
52
+ "question": self.question,
53
+ "steps": steps_to_serializable(self.steps),
54
+ "evidence": evidences_to_serializable(self.evidence),
55
+ "answer": self.answer,
56
+ }
57
+
58
+
59
+ class CoRGIPipeline:
60
+ """Orchestrates the CoRGI reasoning pipeline using a Qwen3-VL client."""
61
+
62
+ def __init__(self, vlm_client: SupportsQwenClient):
63
+ if vlm_client is None:
64
+ raise ValueError("A Qwen3-VL client instance must be provided.")
65
+ self._vlm = vlm_client
66
+
67
+ def run(
68
+ self,
69
+ image: Image.Image,
70
+ question: str,
71
+ max_steps: int = 4,
72
+ max_regions: int = 4,
73
+ ) -> PipelineResult:
74
+ steps = self._vlm.structured_reasoning(image=image, question=question, max_steps=max_steps)
75
+ evidences: List[GroundedEvidence] = []
76
+ for step in steps:
77
+ if not step.needs_vision:
78
+ continue
79
+ step_evs = self._vlm.extract_step_evidence(
80
+ image=image,
81
+ question=question,
82
+ step=step,
83
+ max_regions=max_regions,
84
+ )
85
+ if not step_evs:
86
+ continue
87
+ evidences.extend(step_evs[:max_regions])
88
+ answer = self._vlm.synthesize_answer(image=image, question=question, steps=steps, evidences=evidences)
89
+ return PipelineResult(question=question, steps=steps, evidence=evidences, answer=answer)
90
+
91
+
92
+ __all__ = ["CoRGIPipeline", "PipelineResult"]
corgi/qwen_client.py ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass
4
+ from typing import List, Optional
5
+
6
+ import torch
7
+ from PIL import Image
8
+ from transformers import AutoModelForImageTextToText, AutoProcessor
9
+
10
+ from .parsers import parse_roi_evidence, parse_structured_reasoning
11
+ from .types import GroundedEvidence, ReasoningStep
12
+
13
+
14
+ DEFAULT_REASONING_PROMPT = (
15
+ "You are a careful multimodal reasoner following the CoRGI protocol. "
16
+ "Given the question and the image, produce a JSON array of reasoning steps. "
17
+ "Each item must contain the keys: index (1-based integer), statement (concise sentence), "
18
+ "needs_vision (boolean true if the statement requires visual verification), and reason "
19
+ "(short phrase explaining why visual verification is or is not required). "
20
+ "Limit the number of steps to {max_steps}. Respond with JSON only; start the reply with '[' and end with ']'. "
21
+ "Do not add any commentary or prose outside of the JSON."
22
+ )
23
+
24
+ DEFAULT_GROUNDING_PROMPT = (
25
+ "You are validating the following reasoning step:\n"
26
+ "{step_statement}\n"
27
+ "Return a JSON array with up to {max_regions} region candidates that help verify the step. "
28
+ "Each object must include: step (integer), bbox (list of four numbers x1,y1,x2,y2, "
29
+ "either normalized 0-1 or scaled 0-1000), description (short textual evidence), "
30
+ "and confidence (0-1). Use [] if no relevant region exists. "
31
+ "Respond with JSON only; do not include explanations outside the JSON array."
32
+ )
33
+
34
+ DEFAULT_ANSWER_PROMPT = (
35
+ "You are finalizing the answer using verified evidence. "
36
+ "Question: {question}\n"
37
+ "Structured reasoning steps:\n"
38
+ "{steps}\n"
39
+ "Verified evidence items:\n"
40
+ "{evidence}\n"
41
+ "Respond with a concise final answer sentence grounded in the evidence. "
42
+ "If unsure, say you are uncertain. Do not include <think> tags or internal monologue."
43
+ )
44
+
45
+
46
+ def _format_steps_for_prompt(steps: List[ReasoningStep]) -> str:
47
+ return "\n".join(
48
+ f"{step.index}. {step.statement} (needs vision: {step.needs_vision})"
49
+ for step in steps
50
+ )
51
+
52
+
53
+ def _format_evidence_for_prompt(evidences: List[GroundedEvidence]) -> str:
54
+ if not evidences:
55
+ return "No evidence collected."
56
+ lines = []
57
+ for ev in evidences:
58
+ desc = ev.description or "No description"
59
+ bbox = ", ".join(f"{coord:.2f}" for coord in ev.bbox)
60
+ conf = f"{ev.confidence:.2f}" if ev.confidence is not None else "n/a"
61
+ lines.append(f"Step {ev.step_index}: bbox=({bbox}), conf={conf}, desc={desc}")
62
+ return "\n".join(lines)
63
+
64
+
65
+ def _strip_think_content(text: str) -> str:
66
+ if not text:
67
+ return ""
68
+ cleaned = text
69
+ if "</think>" in cleaned:
70
+ cleaned = cleaned.split("</think>", 1)[-1]
71
+ cleaned = cleaned.replace("<think>", "")
72
+ return cleaned.strip()
73
+
74
+
75
+ @dataclass
76
+ class QwenGenerationConfig:
77
+ model_id: str = "Qwen/Qwen3-VL-8B-Thinking"
78
+ max_new_tokens: int = 512
79
+ temperature: float | None = None
80
+ do_sample: bool = False
81
+
82
+
83
+ class Qwen3VLClient:
84
+ """Wrapper around transformers Qwen3-VL chat API for CoRGI pipeline."""
85
+
86
+ def __init__(
87
+ self,
88
+ config: Optional[QwenGenerationConfig] = None,
89
+ ) -> None:
90
+ self.config = config or QwenGenerationConfig()
91
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
92
+ self._model = AutoModelForImageTextToText.from_pretrained(
93
+ self.config.model_id,
94
+ torch_dtype=torch_dtype,
95
+ device_map="auto",
96
+ )
97
+ self._processor = AutoProcessor.from_pretrained(self.config.model_id)
98
+
99
+ def _chat(
100
+ self,
101
+ image: Image.Image,
102
+ prompt: str,
103
+ max_new_tokens: Optional[int] = None,
104
+ ) -> str:
105
+ messages = [
106
+ {
107
+ "role": "user",
108
+ "content": [
109
+ {"type": "image", "image": image},
110
+ {"type": "text", "text": prompt},
111
+ ],
112
+ }
113
+ ]
114
+ chat_prompt = self._processor.apply_chat_template(
115
+ messages,
116
+ add_generation_prompt=True,
117
+ tokenize=False,
118
+ )
119
+ inputs = self._processor(
120
+ text=[chat_prompt],
121
+ images=[image],
122
+ return_tensors="pt",
123
+ ).to(self._model.device)
124
+ gen_kwargs = {
125
+ "max_new_tokens": max_new_tokens or self.config.max_new_tokens,
126
+ "do_sample": self.config.do_sample,
127
+ }
128
+ if self.config.do_sample and self.config.temperature is not None:
129
+ gen_kwargs["temperature"] = self.config.temperature
130
+ output_ids = self._model.generate(**inputs, **gen_kwargs)
131
+ prompt_length = inputs.input_ids.shape[1]
132
+ generated_tokens = output_ids[:, prompt_length:]
133
+ response = self._processor.batch_decode(
134
+ generated_tokens,
135
+ skip_special_tokens=True,
136
+ clean_up_tokenization_spaces=False,
137
+ )[0]
138
+ return response.strip()
139
+
140
+ def structured_reasoning(self, image: Image.Image, question: str, max_steps: int) -> List[ReasoningStep]:
141
+ prompt = DEFAULT_REASONING_PROMPT.format(max_steps=max_steps) + f"\nQuestion: {question}"
142
+ response = self._chat(image=image, prompt=prompt)
143
+ return parse_structured_reasoning(response, max_steps=max_steps)
144
+
145
+ def extract_step_evidence(
146
+ self,
147
+ image: Image.Image,
148
+ question: str,
149
+ step: ReasoningStep,
150
+ max_regions: int,
151
+ ) -> List[GroundedEvidence]:
152
+ prompt = DEFAULT_GROUNDING_PROMPT.format(
153
+ step_statement=step.statement,
154
+ max_regions=max_regions,
155
+ )
156
+ response = self._chat(image=image, prompt=prompt, max_new_tokens=256)
157
+ evidences = parse_roi_evidence(response, default_step_index=step.index)
158
+ return evidences[:max_regions]
159
+
160
+ def synthesize_answer(
161
+ self,
162
+ image: Image.Image,
163
+ question: str,
164
+ steps: List[ReasoningStep],
165
+ evidences: List[GroundedEvidence],
166
+ ) -> str:
167
+ prompt = DEFAULT_ANSWER_PROMPT.format(
168
+ question=question,
169
+ steps=_format_steps_for_prompt(steps),
170
+ evidence=_format_evidence_for_prompt(evidences),
171
+ )
172
+ response = self._chat(image=image, prompt=prompt, max_new_tokens=256)
173
+ return _strip_think_content(response)
174
+
175
+
176
+ __all__ = ["Qwen3VLClient", "QwenGenerationConfig"]
corgi/types.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass
4
+ from typing import Dict, List, Optional, Tuple
5
+
6
+
7
+ BBox = Tuple[float, float, float, float]
8
+
9
+
10
+ @dataclass(frozen=True)
11
+ class ReasoningStep:
12
+ """Represents a single structured reasoning step."""
13
+
14
+ index: int
15
+ statement: str
16
+ needs_vision: bool
17
+ reason: Optional[str] = None
18
+
19
+
20
+ @dataclass(frozen=True)
21
+ class GroundedEvidence:
22
+ """Evidence item grounded to a region of interest in the image."""
23
+
24
+ step_index: int
25
+ bbox: BBox
26
+ description: Optional[str] = None
27
+ confidence: Optional[float] = None
28
+ raw_source: Optional[Dict[str, object]] = None
29
+
30
+
31
+ def steps_to_serializable(steps: List[ReasoningStep]) -> List[Dict[str, object]]:
32
+ """Helper to convert steps into JSON-friendly dictionaries."""
33
+
34
+ return [
35
+ {
36
+ "index": step.index,
37
+ "statement": step.statement,
38
+ "needs_vision": step.needs_vision,
39
+ **({"reason": step.reason} if step.reason is not None else {}),
40
+ }
41
+ for step in steps
42
+ ]
43
+
44
+
45
+ def evidences_to_serializable(evidences: List[GroundedEvidence]) -> List[Dict[str, object]]:
46
+ """Helper to convert evidences into JSON-friendly dictionaries."""
47
+
48
+ serializable: List[Dict[str, object]] = []
49
+ for ev in evidences:
50
+ item: Dict[str, object] = {
51
+ "step_index": ev.step_index,
52
+ "bbox": list(ev.bbox),
53
+ }
54
+ if ev.description is not None:
55
+ item["description"] = ev.description
56
+ if ev.confidence is not None:
57
+ item["confidence"] = ev.confidence
58
+ if ev.raw_source is not None:
59
+ item["raw_source"] = ev.raw_source
60
+ serializable.append(item)
61
+ return serializable
examples/demo_qwen_corgi.py ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """Run CoRGI pipeline on the Qwen3-VL demo image and question.
3
+
4
+ Usage:
5
+ python examples/demo_qwen_corgi.py [--model-id Qwen/Qwen3-VL-8B-Thinking]
6
+
7
+ If the demo image cannot be downloaded automatically, set the environment
8
+ variable `CORGI_DEMO_IMAGE` to a local file path.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ import argparse
14
+ import os
15
+ from io import BytesIO
16
+ from pathlib import Path
17
+ from urllib.request import urlopen
18
+
19
+ from PIL import Image
20
+
21
+ from corgi.pipeline import CoRGIPipeline
22
+ from corgi.qwen_client import Qwen3VLClient, QwenGenerationConfig
23
+
24
+ DEMO_IMAGE_URL = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
25
+ DEMO_QUESTION = "How many people are there in the image? Is there any one who is wearing a white watch?"
26
+
27
+
28
+ def fetch_demo_image() -> Image.Image:
29
+ if path := os.getenv("CORGI_DEMO_IMAGE"):
30
+ return Image.open(path).convert("RGB")
31
+ with urlopen(DEMO_IMAGE_URL) as resp: # nosec B310 - trusted URL from official demo asset
32
+ data = resp.read()
33
+ return Image.open(BytesIO(data)).convert("RGB")
34
+
35
+
36
+ def format_steps(pipeline_result) -> str:
37
+ lines = ["Reasoning steps:"]
38
+ for step in pipeline_result.steps:
39
+ needs = "yes" if step.needs_vision else "no"
40
+ reason = f" (reason: {step.reason})" if step.reason else ""
41
+ lines.append(f" [{step.index}] {step.statement} — needs vision: {needs}{reason}")
42
+ return "\n".join(lines)
43
+
44
+
45
+ def format_evidence(pipeline_result) -> str:
46
+ lines = ["Visual evidence:"]
47
+ if not pipeline_result.evidence:
48
+ lines.append(" (no evidence returned)")
49
+ return "\n".join(lines)
50
+ for ev in pipeline_result.evidence:
51
+ bbox = ", ".join(f"{coord:.2f}" for coord in ev.bbox)
52
+ desc = ev.description or "(no description)"
53
+ conf = f", conf={ev.confidence:.2f}" if ev.confidence is not None else ""
54
+ lines.append(f" Step {ev.step_index}: bbox=({bbox}), desc={desc}{conf}")
55
+ return "\n".join(lines)
56
+
57
+
58
+ def main() -> int:
59
+ parser = argparse.ArgumentParser(description="Run CoRGI pipeline with the real Qwen3-VL model.")
60
+ parser.add_argument("--model-id", default="Qwen/Qwen3-VL-8B-Thinking", help="Hugging Face model id for Qwen3-VL")
61
+ parser.add_argument("--max-steps", type=int, default=4)
62
+ parser.add_argument("--max-regions", type=int, default=4)
63
+ args = parser.parse_args()
64
+
65
+ image = fetch_demo_image()
66
+ client = Qwen3VLClient(QwenGenerationConfig(model_id=args.model_id))
67
+ pipeline = CoRGIPipeline(client)
68
+
69
+ result = pipeline.run(
70
+ image=image,
71
+ question=DEMO_QUESTION,
72
+ max_steps=args.max_steps,
73
+ max_regions=args.max_regions,
74
+ )
75
+
76
+ print(f"Question: {DEMO_QUESTION}")
77
+ print(format_steps(result))
78
+ print(format_evidence(result))
79
+ print("Answer:", result.answer)
80
+
81
+ return 0
82
+
83
+
84
+ if __name__ == "__main__":
85
+ raise SystemExit(main())
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ accelerate>=0.34
2
+ transformers>=4.45
3
+ pillow
4
+ torch
5
+ gradio>=4.44
6
+ hydra-core
7
+ antlr4-python3-runtime