Spaces:
Runtime error
Runtime error
dung-vpt-uney
commited on
Commit
·
b6a01d6
1
Parent(s):
b9f8b29
Deploy latest CoRGI Gradio demo
Browse files- PROGRESS_LOG.md +33 -0
- PROJECT_PLAN.md +51 -0
- QWEN_INFERENCE_NOTES.md +19 -0
- README.md +12 -11
- app.py +10 -0
- corgi/__init__.py +11 -0
- corgi/__pycache__/__init__.cpython-310.pyc +0 -0
- corgi/__pycache__/__init__.cpython-312.pyc +0 -0
- corgi/__pycache__/__init__.cpython-313.pyc +0 -0
- corgi/__pycache__/cli.cpython-312.pyc +0 -0
- corgi/__pycache__/cli.cpython-313.pyc +0 -0
- corgi/__pycache__/gradio_app.cpython-312.pyc +0 -0
- corgi/__pycache__/gradio_app.cpython-313.pyc +0 -0
- corgi/__pycache__/parsers.cpython-310.pyc +0 -0
- corgi/__pycache__/parsers.cpython-312.pyc +0 -0
- corgi/__pycache__/parsers.cpython-313.pyc +0 -0
- corgi/__pycache__/pipeline.cpython-310.pyc +0 -0
- corgi/__pycache__/pipeline.cpython-312.pyc +0 -0
- corgi/__pycache__/pipeline.cpython-313.pyc +0 -0
- corgi/__pycache__/qwen_client.cpython-312.pyc +0 -0
- corgi/__pycache__/qwen_client.cpython-313.pyc +0 -0
- corgi/__pycache__/types.cpython-310.pyc +0 -0
- corgi/__pycache__/types.cpython-312.pyc +0 -0
- corgi/__pycache__/types.cpython-313.pyc +0 -0
- corgi/cli.py +131 -0
- corgi/gradio_app.py +166 -0
- corgi/parsers.py +390 -0
- corgi/pipeline.py +92 -0
- corgi/qwen_client.py +176 -0
- corgi/types.py +61 -0
- examples/demo_qwen_corgi.py +85 -0
- requirements.txt +7 -0
PROGRESS_LOG.md
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CoRGI Custom Demo — Progress Log
|
| 2 |
+
|
| 3 |
+
> Keep this log short and chronological. Newest updates at the top.
|
| 4 |
+
|
| 5 |
+
## 2024-10-22
|
| 6 |
+
- Reproduced the CoRGI pipeline failure with the real `Qwen/Qwen3-VL-8B-Thinking` checkpoint and traced it to reasoning outputs that only use ordinal step words.
|
| 7 |
+
- Taught the text parser to normalize “First/Second step” style markers into numeric indices, refreshed the unit tests to cover the new heuristic, and reran the demo/end-to-end pipeline successfully.
|
| 8 |
+
- Tidied Qwen generation settings to avoid unused temperature flags when running deterministically.
|
| 9 |
+
- Validated ROI extraction on a vision-heavy prompt against the real model and hardened prompts so responses stay in structured JSON without verbose preambles.
|
| 10 |
+
- Added meta-comment pruning so thinking-mode rambles (e.g., redundant “Step 3” reflections) are dropped while preserving genuine reasoning; confirmed with the official demo image that only meaningful steps remain.
|
| 11 |
+
|
| 12 |
+
## 2024-10-21
|
| 13 |
+
- Updated default checkpoints to `Qwen/Qwen3-VL-8B-Thinking` and verified CLI/Gradio/test coverage.
|
| 14 |
+
- Exercised the real model to capture thinking-style outputs; added parser fallbacks for textual reasoning/ROI responses and stripped `<think>` tags from answer synthesis.
|
| 15 |
+
- Extended unit test suite (reasoning, ROI, client helpers) to cover the new parsing paths and ran `pytest` successfully.
|
| 16 |
+
|
| 17 |
+
## 2024-10-20
|
| 18 |
+
- Added optional integration test (`corgi_tests/test_integration_qwen.py`) gated by `CORGI_RUN_QWEN_INTEGRATION` for running the real Qwen3-VL model on the official demo asset.
|
| 19 |
+
- Created runnable example script (`examples/demo_qwen_corgi.py`) to reproduce the Hugging Face demo prompt locally with structured pipeline logging.
|
| 20 |
+
- Published Hugging Face Space harness (`app.py`) and deployment helper (`scripts/push_space.sh`) including requirements for ZeroGPU tier.
|
| 21 |
+
- Documented cookbook alignment and inference tips (`QWEN_INFERENCE_NOTES.md`).
|
| 22 |
+
- Added CLI runner (`corgi.cli`) with formatting helpers plus JSON export; authored matching unittest coverage.
|
| 23 |
+
- Implemented Gradio demo harness (`corgi.gradio_app`) with markdown reporting and helper utilities for dependency injection.
|
| 24 |
+
- Expanded unit test suite (CLI + Gradio) and ran `pytest corgi_tests` successfully (1 skip when gradio missing).
|
| 25 |
+
- Initialized structured project plan and progress log scaffolding.
|
| 26 |
+
- Assessed existing modules (`corgi.pipeline`, `corgi.qwen_client`, parsers, tests) to identify pending demo features (CLI + Gradio).
|
| 27 |
+
- Confirmed Qwen3-VL will be the single backbone for reasoning, ROI verification, and answer synthesis.
|
| 28 |
+
|
| 29 |
+
<!-- Template for future updates:
|
| 30 |
+
## YYYY-MM-DD
|
| 31 |
+
- Summary of change / milestone.
|
| 32 |
+
- Follow-up actions.
|
| 33 |
+
-->
|
PROJECT_PLAN.md
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CoRGI Custom Demo — Project Plan
|
| 2 |
+
|
| 3 |
+
## Context
|
| 4 |
+
- **Objective**: ship a runnable CoRGI demo (CLI + Gradio) powered entirely by Qwen3-VL for structured reasoning, ROI evidence extraction, and answer synthesis.
|
| 5 |
+
- **Scope**: stay within the `corgi_custom` package, reuse Qwen3-VL cookbooks where possible, keep dependency footprint minimal (no extra detectors/rerankers).
|
| 6 |
+
- **Environment**: Conda env `pytorch`, default VLM `Qwen/Qwen3-VL-8B-Thinking`.
|
| 7 |
+
|
| 8 |
+
## Milestones
|
| 9 |
+
| Status | Milestone | Notes |
|
| 10 |
+
| --- | --- | --- |
|
| 11 |
+
| ✅ | Core pipeline skeleton (dataclasses, parsers, Qwen client wrappers) | Already merged in repo. |
|
| 12 |
+
| ✅ | Project documentation & progress tracking scaffolding | Plan + progress log committed. |
|
| 13 |
+
| ✅ | CLI runner that prints step-by-step pipeline output | Supports overrides + JSON export. |
|
| 14 |
+
| ✅ | Gradio demo mirroring CLI functionality | Blocks UI with markdown report messaging. |
|
| 15 |
+
| ✅ | Automated tests for new modules | CLI + Gradio helpers covered with unit tests. |
|
| 16 |
+
| ✅ | HF Space deployment automation | Bash script + app harness for zerogpu Spaces. |
|
| 17 |
+
| 🟡 | Final verification (unit tests, smoke instructions) | Document how to run `pytest` and the demos. |
|
| 18 |
+
|
| 19 |
+
## Work Breakdown Structure
|
| 20 |
+
1. **Docs & Tracking**
|
| 21 |
+
- [x] Finalize plan and progress log templates.
|
| 22 |
+
- [x] Document environment setup expectations.
|
| 23 |
+
2. **Pipeline UX**
|
| 24 |
+
- [x] Implement CLI entrypoint (`corgi.cli:main`).
|
| 25 |
+
- [x] Provide structured stdout for steps/evidence/answer.
|
| 26 |
+
- [x] Allow optional JSON dump for downstream tooling.
|
| 27 |
+
3. **Interactive Demo**
|
| 28 |
+
- [x] Build Gradio app harness (image upload + question textbox).
|
| 29 |
+
- [ ] Stream progress (optional) and display textual reasoning/evidence.
|
| 30 |
+
- [x] Handle model loading errors gracefully.
|
| 31 |
+
4. **Testing & Tooling**
|
| 32 |
+
- [x] Add fixture-friendly helpers to avoid heavy model loads in tests.
|
| 33 |
+
- [x] Write unit tests for CLI argument parsing + formatting.
|
| 34 |
+
- [ ] Add regression test for pipeline serialization.
|
| 35 |
+
5. **Docs & Hand-off**
|
| 36 |
+
- [ ] Update README/demo instructions.
|
| 37 |
+
- [ ] Provide sample command sequences for CLI/Gradio.
|
| 38 |
+
- [ ] Capture open risks & future enhancements.
|
| 39 |
+
6. **Deployment & Ops**
|
| 40 |
+
- [x] Add Hugging Face Space entrypoint (`app.py`).
|
| 41 |
+
- [x] Write deployment helper script (`scripts/push_space.sh`).
|
| 42 |
+
- [ ] Add automated checklists/logs for Space updates.
|
| 43 |
+
|
| 44 |
+
## Risks & Mitigations
|
| 45 |
+
- **Model loading latency / VRAM** → expose config knobs and mention 4B fallback.
|
| 46 |
+
- **Parsing drift from Qwen outputs** → keep parser tolerant; add debug flag to dump raw responses.
|
| 47 |
+
- **Test runtime** → mock Qwen client via fixtures; avoid loading real model in unit tests.
|
| 48 |
+
|
| 49 |
+
## Progress Tracking
|
| 50 |
+
- Refer to `PROGRESS_LOG.md` for dated status updates.
|
| 51 |
+
- Update milestone table whenever a deliverable completes.
|
QWEN_INFERENCE_NOTES.md
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Qwen3-VL Cookbook Alignment
|
| 2 |
+
|
| 3 |
+
This project mirrors the official Qwen3-VL cookbook patterns (see `../Qwen3-VL/cookbooks`) when running the CoRGI pipeline with the real model.
|
| 4 |
+
|
| 5 |
+
## Key Parallels
|
| 6 |
+
- **Model + Processor loading**: We rely on `AutoModelForImageTextToText` and `AutoProcessor` exactly as described in the main README and cookbook notebooks such as `think_with_images.ipynb`.
|
| 7 |
+
- **Chat template**: `Qwen3VLClient` uses `processor.apply_chat_template(..., add_generation_prompt=True)` before calling `generate`, which matches the recommended multi-turn messaging flow.
|
| 8 |
+
- **Image transport**: Both the pipeline and demo scripts accept PIL images and ensure conversion to RGB prior to inference, mirroring cookbook utilities that normalize channels.
|
| 9 |
+
- **Max tokens & decoding**: Default `max_new_tokens=512` and `temperature=0.2` align with cookbook demos favouring deterministic outputs for evaluation.
|
| 10 |
+
- **Single-model pipeline**: All stages (reasoning, ROI extraction, answer synthesis) are executed by the same Qwen3-VL instance, following the cookbook philosophy of leveraging the model’s intrinsic grounding capability without external detectors.
|
| 11 |
+
|
| 12 |
+
## Practical Tips for Local Inference
|
| 13 |
+
- Use the `pytorch` Conda env with the latest `transformers` (>=4.45) to access `AutoModelForImageTextToText` support, as advised in the cookbook README.
|
| 14 |
+
- When VRAM is limited, switch to `Qwen/Qwen3-VL-4B-Instruct` via `--model-id` or `CORGI_QWEN_MODEL` environment variable—no other code changes needed.
|
| 15 |
+
- The integration test (`corgi_tests/test_integration_qwen.py`) and demo (`examples/demo_qwen_corgi.py`) download the official demo image if `CORGI_DEMO_IMAGE` is not supplied, matching cookbook notebooks that reference the same asset URL.
|
| 16 |
+
- For reproducibility, set `HF_HOME` (or use the cookbook’s `snapshot_download`) to manage local caches and avoid repeated downloads.
|
| 17 |
+
- The `Qwen/Qwen3-VL-8B-Thinking` checkpoint often emits free-form “thinking” text instead of JSON; the pipeline now falls back to parsing those narratives for step and ROI extraction, and strips `<think>…</think>` scaffolding from final answers.
|
| 18 |
+
|
| 19 |
+
These notes ensure our CoRGI adaptation stays consistent with the official Qwen workflow while keeping the codebase modular for experimentation.
|
README.md
CHANGED
|
@@ -1,12 +1,13 @@
|
|
| 1 |
-
|
| 2 |
-
title: Corgi Qwen3 Vl Demo
|
| 3 |
-
emoji: 😻
|
| 4 |
-
colorFrom: green
|
| 5 |
-
colorTo: indigo
|
| 6 |
-
sdk: gradio
|
| 7 |
-
sdk_version: 5.49.1
|
| 8 |
-
app_file: app.py
|
| 9 |
-
pinned: false
|
| 10 |
-
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CoRGI Qwen3-VL Demo
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
This Space hosts the CoRGI reasoning pipeline backed by the Qwen/Qwen3-VL-8B-Thinking model.
|
| 4 |
+
|
| 5 |
+
## Run Locally
|
| 6 |
+
```
|
| 7 |
+
pip install -r requirements.txt
|
| 8 |
+
python examples/demo_qwen_corgi.py
|
| 9 |
+
```
|
| 10 |
+
|
| 11 |
+
## Notes
|
| 12 |
+
- The demo queues requests sequentially (ZeroGPU/cpu-basic hardware).
|
| 13 |
+
- Configure `CORGI_QWEN_MODEL` to switch to a different checkpoint.
|
app.py
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Hugging Face Spaces entrypoint for the CoRGI Qwen3-VL demo."""
|
| 2 |
+
|
| 3 |
+
from corgi.gradio_app import build_demo
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
demo = build_demo()
|
| 7 |
+
demo.queue(concurrency_count=1)
|
| 8 |
+
|
| 9 |
+
if __name__ == "__main__":
|
| 10 |
+
demo.launch()
|
corgi/__init__.py
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""CoRGI pipeline package using Qwen3-VL."""
|
| 2 |
+
|
| 3 |
+
from .pipeline import CoRGIPipeline, PipelineResult
|
| 4 |
+
from .types import GroundedEvidence, ReasoningStep
|
| 5 |
+
|
| 6 |
+
__all__ = [
|
| 7 |
+
"CoRGIPipeline",
|
| 8 |
+
"PipelineResult",
|
| 9 |
+
"GroundedEvidence",
|
| 10 |
+
"ReasoningStep",
|
| 11 |
+
]
|
corgi/__pycache__/__init__.cpython-310.pyc
ADDED
|
Binary file (413 Bytes). View file
|
|
|
corgi/__pycache__/__init__.cpython-312.pyc
ADDED
|
Binary file (419 Bytes). View file
|
|
|
corgi/__pycache__/__init__.cpython-313.pyc
ADDED
|
Binary file (419 Bytes). View file
|
|
|
corgi/__pycache__/cli.cpython-312.pyc
ADDED
|
Binary file (6.41 kB). View file
|
|
|
corgi/__pycache__/cli.cpython-313.pyc
ADDED
|
Binary file (6.39 kB). View file
|
|
|
corgi/__pycache__/gradio_app.cpython-312.pyc
ADDED
|
Binary file (8.03 kB). View file
|
|
|
corgi/__pycache__/gradio_app.cpython-313.pyc
ADDED
|
Binary file (8.24 kB). View file
|
|
|
corgi/__pycache__/parsers.cpython-310.pyc
ADDED
|
Binary file (4.61 kB). View file
|
|
|
corgi/__pycache__/parsers.cpython-312.pyc
ADDED
|
Binary file (18.1 kB). View file
|
|
|
corgi/__pycache__/parsers.cpython-313.pyc
ADDED
|
Binary file (18.8 kB). View file
|
|
|
corgi/__pycache__/pipeline.cpython-310.pyc
ADDED
|
Binary file (3.13 kB). View file
|
|
|
corgi/__pycache__/pipeline.cpython-312.pyc
ADDED
|
Binary file (3.86 kB). View file
|
|
|
corgi/__pycache__/pipeline.cpython-313.pyc
ADDED
|
Binary file (3.97 kB). View file
|
|
|
corgi/__pycache__/qwen_client.cpython-312.pyc
ADDED
|
Binary file (9.01 kB). View file
|
|
|
corgi/__pycache__/qwen_client.cpython-313.pyc
ADDED
|
Binary file (9.13 kB). View file
|
|
|
corgi/__pycache__/types.cpython-310.pyc
ADDED
|
Binary file (2.16 kB). View file
|
|
|
corgi/__pycache__/types.cpython-312.pyc
ADDED
|
Binary file (2.67 kB). View file
|
|
|
corgi/__pycache__/types.cpython-313.pyc
ADDED
|
Binary file (2.77 kB). View file
|
|
|
corgi/cli.py
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import argparse
|
| 4 |
+
import json
|
| 5 |
+
import sys
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
from typing import Callable, Optional, TextIO
|
| 8 |
+
|
| 9 |
+
from PIL import Image
|
| 10 |
+
|
| 11 |
+
from .pipeline import CoRGIPipeline
|
| 12 |
+
from .qwen_client import Qwen3VLClient, QwenGenerationConfig
|
| 13 |
+
from .types import GroundedEvidence, ReasoningStep
|
| 14 |
+
|
| 15 |
+
DEFAULT_MODEL_ID = "Qwen/Qwen3-VL-8B-Thinking"
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def build_parser() -> argparse.ArgumentParser:
|
| 19 |
+
parser = argparse.ArgumentParser(
|
| 20 |
+
prog="corgi-cli",
|
| 21 |
+
description="Run the CoRGI reasoning pipeline over an image/question pair.",
|
| 22 |
+
)
|
| 23 |
+
parser.add_argument("--image", type=Path, required=True, help="Path to the input image (jpg/png/etc.)")
|
| 24 |
+
parser.add_argument("--question", type=str, required=True, help="Visual question for the image")
|
| 25 |
+
parser.add_argument("--max-steps", type=int, default=4, help="Maximum number of reasoning steps to request")
|
| 26 |
+
parser.add_argument(
|
| 27 |
+
"--max-regions",
|
| 28 |
+
type=int,
|
| 29 |
+
default=4,
|
| 30 |
+
help="Maximum number of grounded regions per visual step",
|
| 31 |
+
)
|
| 32 |
+
parser.add_argument(
|
| 33 |
+
"--model-id",
|
| 34 |
+
type=str,
|
| 35 |
+
default=None,
|
| 36 |
+
help="Optional override for the Qwen3-VL model identifier",
|
| 37 |
+
)
|
| 38 |
+
parser.add_argument(
|
| 39 |
+
"--json-out",
|
| 40 |
+
type=Path,
|
| 41 |
+
default=None,
|
| 42 |
+
help="Optional path to write the pipeline result as JSON",
|
| 43 |
+
)
|
| 44 |
+
return parser
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def _format_step(step: ReasoningStep) -> str:
|
| 48 |
+
needs = "yes" if step.needs_vision else "no"
|
| 49 |
+
suffix = f"; reason: {step.reason}" if step.reason else ""
|
| 50 |
+
return f"[{step.index}] {step.statement} (needs vision: {needs}{suffix})"
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def _format_evidence_item(evidence: GroundedEvidence) -> str:
|
| 54 |
+
bbox = ", ".join(f"{coord:.2f}" for coord in evidence.bbox)
|
| 55 |
+
parts = [f"Step {evidence.step_index} | bbox=({bbox})"]
|
| 56 |
+
if evidence.description:
|
| 57 |
+
parts.append(f"desc: {evidence.description}")
|
| 58 |
+
if evidence.confidence is not None:
|
| 59 |
+
parts.append(f"conf: {evidence.confidence:.2f}")
|
| 60 |
+
return " | ".join(parts)
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def _default_pipeline_factory(model_id: Optional[str]) -> CoRGIPipeline:
|
| 64 |
+
config = QwenGenerationConfig(model_id=model_id or DEFAULT_MODEL_ID)
|
| 65 |
+
client = Qwen3VLClient(config=config)
|
| 66 |
+
return CoRGIPipeline(vlm_client=client)
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def execute_cli(
|
| 70 |
+
*,
|
| 71 |
+
image_path: Path,
|
| 72 |
+
question: str,
|
| 73 |
+
max_steps: int,
|
| 74 |
+
max_regions: int,
|
| 75 |
+
model_id: Optional[str],
|
| 76 |
+
json_out: Optional[Path],
|
| 77 |
+
pipeline_factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
|
| 78 |
+
output_stream: TextIO | None = None,
|
| 79 |
+
) -> None:
|
| 80 |
+
if output_stream is None:
|
| 81 |
+
output_stream = sys.stdout
|
| 82 |
+
factory = pipeline_factory or _default_pipeline_factory
|
| 83 |
+
|
| 84 |
+
with Image.open(image_path) as img:
|
| 85 |
+
image = img.convert("RGB")
|
| 86 |
+
pipeline = factory(model_id)
|
| 87 |
+
result = pipeline.run(
|
| 88 |
+
image=image,
|
| 89 |
+
question=question,
|
| 90 |
+
max_steps=max_steps,
|
| 91 |
+
max_regions=max_regions,
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
print(f"Question: {question}", file=output_stream)
|
| 95 |
+
print("-- Steps --", file=output_stream)
|
| 96 |
+
for step in result.steps:
|
| 97 |
+
print(_format_step(step), file=output_stream)
|
| 98 |
+
if not result.steps:
|
| 99 |
+
print("(no reasoning steps returned)", file=output_stream)
|
| 100 |
+
|
| 101 |
+
print("-- Evidence --", file=output_stream)
|
| 102 |
+
if result.evidence:
|
| 103 |
+
for evidence in result.evidence:
|
| 104 |
+
print(_format_evidence_item(evidence), file=output_stream)
|
| 105 |
+
else:
|
| 106 |
+
print("(no visual evidence)", file=output_stream)
|
| 107 |
+
|
| 108 |
+
print("-- Answer --", file=output_stream)
|
| 109 |
+
print(f"Answer: {result.answer}", file=output_stream)
|
| 110 |
+
|
| 111 |
+
if json_out is not None:
|
| 112 |
+
json_out.parent.mkdir(parents=True, exist_ok=True)
|
| 113 |
+
with json_out.open("w", encoding="utf-8") as handle:
|
| 114 |
+
json.dump(result.to_json(), handle, ensure_ascii=False, indent=2)
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
def main(argv: Optional[list[str]] = None) -> int:
|
| 118 |
+
parser = build_parser()
|
| 119 |
+
args = parser.parse_args(argv)
|
| 120 |
+
execute_cli(
|
| 121 |
+
image_path=args.image,
|
| 122 |
+
question=args.question,
|
| 123 |
+
max_steps=args.max_steps,
|
| 124 |
+
max_regions=args.max_regions,
|
| 125 |
+
model_id=args.model_id,
|
| 126 |
+
json_out=args.json_out,
|
| 127 |
+
)
|
| 128 |
+
return 0
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
__all__ = ["build_parser", "execute_cli", "main"]
|
corgi/gradio_app.py
ADDED
|
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from dataclasses import dataclass
|
| 4 |
+
from typing import Callable, Optional
|
| 5 |
+
|
| 6 |
+
from PIL import Image
|
| 7 |
+
|
| 8 |
+
from .cli import DEFAULT_MODEL_ID
|
| 9 |
+
from .pipeline import CoRGIPipeline, PipelineResult
|
| 10 |
+
from .qwen_client import Qwen3VLClient, QwenGenerationConfig
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
@dataclass
|
| 14 |
+
class PipelineState:
|
| 15 |
+
model_id: str
|
| 16 |
+
pipeline: Optional[CoRGIPipeline]
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def _default_factory(model_id: Optional[str]) -> CoRGIPipeline:
|
| 20 |
+
config = QwenGenerationConfig(model_id=model_id or DEFAULT_MODEL_ID)
|
| 21 |
+
return CoRGIPipeline(vlm_client=Qwen3VLClient(config=config))
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
def ensure_pipeline_state(
|
| 25 |
+
previous: Optional[PipelineState],
|
| 26 |
+
model_id: Optional[str],
|
| 27 |
+
factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
|
| 28 |
+
) -> PipelineState:
|
| 29 |
+
target_model = model_id or DEFAULT_MODEL_ID
|
| 30 |
+
factory = factory or _default_factory
|
| 31 |
+
if previous is not None and previous.model_id == target_model:
|
| 32 |
+
return previous
|
| 33 |
+
pipeline = factory(target_model)
|
| 34 |
+
return PipelineState(model_id=target_model, pipeline=pipeline)
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def format_result_markdown(result: PipelineResult) -> str:
|
| 38 |
+
lines: list[str] = []
|
| 39 |
+
lines.append("### Answer")
|
| 40 |
+
lines.append(result.answer or "(no answer returned)")
|
| 41 |
+
lines.append("")
|
| 42 |
+
lines.append("### Reasoning Steps")
|
| 43 |
+
if result.steps:
|
| 44 |
+
for step in result.steps:
|
| 45 |
+
needs = "yes" if step.needs_vision else "no"
|
| 46 |
+
reason = f" — {step.reason}" if step.reason else ""
|
| 47 |
+
lines.append(f"- **Step {step.index}**: {step.statement} _(needs vision: {needs})_{reason}")
|
| 48 |
+
else:
|
| 49 |
+
lines.append("- No reasoning steps returned.")
|
| 50 |
+
lines.append("")
|
| 51 |
+
lines.append("### Visual Evidence")
|
| 52 |
+
if result.evidence:
|
| 53 |
+
for ev in result.evidence:
|
| 54 |
+
bbox = ", ".join(f"{coord:.2f}" for coord in ev.bbox)
|
| 55 |
+
desc = ev.description or "(no description)"
|
| 56 |
+
conf = f" — confidence {ev.confidence:.2f}" if ev.confidence is not None else ""
|
| 57 |
+
lines.append(f"- Step {ev.step_index}: bbox=({bbox}) — {desc}{conf}")
|
| 58 |
+
else:
|
| 59 |
+
lines.append("- No visual evidence collected.")
|
| 60 |
+
return "\n".join(lines)
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def _run_pipeline(
|
| 64 |
+
state: Optional[PipelineState],
|
| 65 |
+
image: Image.Image | None,
|
| 66 |
+
question: str,
|
| 67 |
+
max_steps: int,
|
| 68 |
+
max_regions: int,
|
| 69 |
+
model_id: Optional[str],
|
| 70 |
+
factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
|
| 71 |
+
) -> tuple[PipelineState, str]:
|
| 72 |
+
if image is None:
|
| 73 |
+
return state or PipelineState(model_id=model_id or DEFAULT_MODEL_ID, pipeline=None), "Please provide an image before running the demo."
|
| 74 |
+
if not question.strip():
|
| 75 |
+
return state or PipelineState(model_id=model_id or DEFAULT_MODEL_ID, pipeline=None), "Please enter a question before running the demo."
|
| 76 |
+
new_state = ensure_pipeline_state(state if state and state.pipeline else None, model_id, factory)
|
| 77 |
+
result = new_state.pipeline.run(
|
| 78 |
+
image=image.convert("RGB"),
|
| 79 |
+
question=question.strip(),
|
| 80 |
+
max_steps=int(max_steps),
|
| 81 |
+
max_regions=int(max_regions),
|
| 82 |
+
)
|
| 83 |
+
markdown = format_result_markdown(result)
|
| 84 |
+
return new_state, markdown
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def build_demo(
|
| 88 |
+
pipeline_factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
|
| 89 |
+
) -> "gradio.Blocks":
|
| 90 |
+
try:
|
| 91 |
+
import gradio as gr
|
| 92 |
+
except ImportError as exc: # pragma: no cover - exercised when gradio missing
|
| 93 |
+
raise RuntimeError("Gradio is required to build the demo. Install gradio>=4.0.") from exc
|
| 94 |
+
|
| 95 |
+
factory = pipeline_factory or _default_factory
|
| 96 |
+
|
| 97 |
+
with gr.Blocks(title="CoRGI Qwen3-VL Demo") as demo:
|
| 98 |
+
state = gr.State() # stores PipelineState
|
| 99 |
+
|
| 100 |
+
with gr.Row():
|
| 101 |
+
with gr.Column(scale=1, min_width=320):
|
| 102 |
+
image_input = gr.Image(label="Input image", type="pil")
|
| 103 |
+
question_input = gr.Textbox(label="Question", placeholder="What is happening in the image?", lines=2)
|
| 104 |
+
model_id_input = gr.Textbox(
|
| 105 |
+
label="Model ID",
|
| 106 |
+
value=DEFAULT_MODEL_ID,
|
| 107 |
+
placeholder="Leave blank to use default",
|
| 108 |
+
)
|
| 109 |
+
max_steps_slider = gr.Slider(
|
| 110 |
+
label="Max reasoning steps",
|
| 111 |
+
minimum=1,
|
| 112 |
+
maximum=6,
|
| 113 |
+
step=1,
|
| 114 |
+
value=4,
|
| 115 |
+
)
|
| 116 |
+
max_regions_slider = gr.Slider(
|
| 117 |
+
label="Max regions per step",
|
| 118 |
+
minimum=1,
|
| 119 |
+
maximum=6,
|
| 120 |
+
step=1,
|
| 121 |
+
value=4,
|
| 122 |
+
)
|
| 123 |
+
run_button = gr.Button("Run CoRGI")
|
| 124 |
+
|
| 125 |
+
with gr.Column(scale=1, min_width=320):
|
| 126 |
+
result_markdown = gr.Markdown(value="Upload an image and ask a question to begin.")
|
| 127 |
+
|
| 128 |
+
def _on_submit(state_data, image, question, model_id, max_steps, max_regions):
|
| 129 |
+
pipeline_state = state_data if isinstance(state_data, PipelineState) else None
|
| 130 |
+
new_state, markdown = _run_pipeline(
|
| 131 |
+
pipeline_state,
|
| 132 |
+
image,
|
| 133 |
+
question,
|
| 134 |
+
int(max_steps),
|
| 135 |
+
int(max_regions),
|
| 136 |
+
model_id if model_id else None,
|
| 137 |
+
factory,
|
| 138 |
+
)
|
| 139 |
+
return new_state, markdown
|
| 140 |
+
|
| 141 |
+
run_button.click(
|
| 142 |
+
fn=_on_submit,
|
| 143 |
+
inputs=[state, image_input, question_input, model_id_input, max_steps_slider, max_regions_slider],
|
| 144 |
+
outputs=[state, result_markdown],
|
| 145 |
+
)
|
| 146 |
+
|
| 147 |
+
return demo
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
def launch_demo(
|
| 151 |
+
*,
|
| 152 |
+
pipeline_factory: Callable[[Optional[str]], CoRGIPipeline] | None = None,
|
| 153 |
+
**launch_kwargs,
|
| 154 |
+
) -> None:
|
| 155 |
+
demo = build_demo(pipeline_factory=pipeline_factory)
|
| 156 |
+
demo.launch(**launch_kwargs)
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
__all__ = [
|
| 160 |
+
"PipelineState",
|
| 161 |
+
"ensure_pipeline_state",
|
| 162 |
+
"format_result_markdown",
|
| 163 |
+
"build_demo",
|
| 164 |
+
"launch_demo",
|
| 165 |
+
"DEFAULT_MODEL_ID",
|
| 166 |
+
]
|
corgi/parsers.py
ADDED
|
@@ -0,0 +1,390 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
import re
|
| 5 |
+
from typing import Any, Iterable, List
|
| 6 |
+
|
| 7 |
+
from .types import GroundedEvidence, ReasoningStep
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
_JSON_FENCE_RE = re.compile(r"```(?:json)?(.*?)```", re.DOTALL | re.IGNORECASE)
|
| 11 |
+
_STEP_MARKER_RE = re.compile(r"(?im)(?:^|\n)\s*(?:step\s*(\d+)|(\d+)[\.\)])\s*[:\-]?\s*")
|
| 12 |
+
_NEEDS_VISION_RE = re.compile(
|
| 13 |
+
r"needs[\s_]*vision\s*[:\-]?\s*(?P<value>true|false|yes|no|required|not required|necessary|unnecessary)",
|
| 14 |
+
re.IGNORECASE,
|
| 15 |
+
)
|
| 16 |
+
_REASON_RE = re.compile(r"reason\s*[:\-]\s*(?P<value>.+)", re.IGNORECASE)
|
| 17 |
+
_BOX_RE = re.compile(
|
| 18 |
+
r"\[\s*-?\d+(?:\.\d+)?\s*,\s*-?\d+(?:\.\d+)?\s*,\s*-?\d+(?:\.\d+)?\s*,\s*-?\d+(?:\.\d+)?\s*\]"
|
| 19 |
+
)
|
| 20 |
+
|
| 21 |
+
_ORDINAL_WORD_MAP = {
|
| 22 |
+
"first": 1,
|
| 23 |
+
"second": 2,
|
| 24 |
+
"third": 3,
|
| 25 |
+
"fourth": 4,
|
| 26 |
+
"fifth": 5,
|
| 27 |
+
"sixth": 6,
|
| 28 |
+
"seventh": 7,
|
| 29 |
+
"eighth": 8,
|
| 30 |
+
"ninth": 9,
|
| 31 |
+
"tenth": 10,
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
_NUMBER_WORD_MAP = {
|
| 35 |
+
"one": 1,
|
| 36 |
+
"two": 2,
|
| 37 |
+
"three": 3,
|
| 38 |
+
"four": 4,
|
| 39 |
+
"five": 5,
|
| 40 |
+
"six": 6,
|
| 41 |
+
"seven": 7,
|
| 42 |
+
"eight": 8,
|
| 43 |
+
"nine": 9,
|
| 44 |
+
"ten": 10,
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
_ORDINAL_STEP_RE = re.compile(
|
| 48 |
+
r"(?im)\b(?P<word>first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth)\s+step\b"
|
| 49 |
+
)
|
| 50 |
+
_WORD_STEP_RE = re.compile(
|
| 51 |
+
r"(?im)\bstep\s+(?P<word>one|two|three|four|five|six|seven|eight|nine|ten)\b"
|
| 52 |
+
)
|
| 53 |
+
|
| 54 |
+
_META_TOKENS = {"maybe", "wait", "let's", "lets", "question", "protocol"}
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
def _to_bool(value: Any) -> bool:
|
| 58 |
+
if isinstance(value, bool):
|
| 59 |
+
return value
|
| 60 |
+
if value is None:
|
| 61 |
+
return False
|
| 62 |
+
if isinstance(value, (int, float)):
|
| 63 |
+
return value != 0
|
| 64 |
+
if isinstance(value, str):
|
| 65 |
+
lowered = value.strip().lower()
|
| 66 |
+
if lowered in {"true", "t", "yes", "y", "1"}:
|
| 67 |
+
return True
|
| 68 |
+
if lowered in {"false", "f", "no", "n", "0"}:
|
| 69 |
+
return False
|
| 70 |
+
return False
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def _extract_json_strings(text: str) -> Iterable[str]:
|
| 74 |
+
"""Return candidate JSON payloads from the response text."""
|
| 75 |
+
|
| 76 |
+
fenced = _JSON_FENCE_RE.findall(text)
|
| 77 |
+
if fenced:
|
| 78 |
+
for body in fenced:
|
| 79 |
+
yield body.strip()
|
| 80 |
+
stripped = text.strip()
|
| 81 |
+
if stripped:
|
| 82 |
+
yield stripped
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def _load_first_json(text: str) -> Any:
|
| 86 |
+
last_error = None
|
| 87 |
+
for candidate in _extract_json_strings(text):
|
| 88 |
+
try:
|
| 89 |
+
return json.loads(candidate)
|
| 90 |
+
except json.JSONDecodeError as err:
|
| 91 |
+
last_error = err
|
| 92 |
+
continue
|
| 93 |
+
if last_error:
|
| 94 |
+
raise ValueError(f"Unable to parse JSON from response: {last_error}") from last_error
|
| 95 |
+
raise ValueError("Empty response, cannot parse JSON.")
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
def _trim_reasoning_text(text: str) -> str:
|
| 99 |
+
lowered = text.lower()
|
| 100 |
+
for anchor in ("let's draft", "draft:", "structured steps", "final reasoning"):
|
| 101 |
+
pos = lowered.rfind(anchor)
|
| 102 |
+
if pos != -1:
|
| 103 |
+
return text[pos:]
|
| 104 |
+
return text
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
def _clean_sentence(text: str) -> str:
|
| 108 |
+
return " ".join(text.strip().split())
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def _normalize_step_markers(text: str) -> str:
|
| 112 |
+
"""Convert ordinal step markers into numeric form (e.g., 'First step' -> 'Step 1')."""
|
| 113 |
+
|
| 114 |
+
def replace_ordinal(match: re.Match[str]) -> str:
|
| 115 |
+
word = match.group("word").lower()
|
| 116 |
+
num = _ORDINAL_WORD_MAP.get(word)
|
| 117 |
+
return f"Step {num}" if num is not None else match.group(0)
|
| 118 |
+
|
| 119 |
+
def replace_word_number(match: re.Match[str]) -> str:
|
| 120 |
+
word = match.group("word").lower()
|
| 121 |
+
num = _NUMBER_WORD_MAP.get(word)
|
| 122 |
+
return f"Step {num}" if num is not None else match.group(0)
|
| 123 |
+
|
| 124 |
+
normalized = _ORDINAL_STEP_RE.sub(replace_ordinal, text)
|
| 125 |
+
normalized = _WORD_STEP_RE.sub(replace_word_number, normalized)
|
| 126 |
+
return normalized
|
| 127 |
+
|
| 128 |
+
|
| 129 |
+
def _extract_statement(body: str) -> str | None:
|
| 130 |
+
statement_match = re.search(r"statement\s*[:\-]\s*(.+)", body, re.IGNORECASE)
|
| 131 |
+
candidate = statement_match.group(1) if statement_match else body
|
| 132 |
+
# Remove trailing sections that describe vision or reason metadata.
|
| 133 |
+
candidate = re.split(r"(?i)needs\s*vision|reason\s*[:\-]", candidate)[0]
|
| 134 |
+
candidate = candidate.strip().strip(".")
|
| 135 |
+
if not candidate:
|
| 136 |
+
return None
|
| 137 |
+
return _clean_sentence(candidate)
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
def _extract_needs_vision(body: str) -> bool:
|
| 141 |
+
match = _NEEDS_VISION_RE.search(body)
|
| 142 |
+
if not match:
|
| 143 |
+
return True
|
| 144 |
+
token = match.group("value").strip().lower()
|
| 145 |
+
if token in {"not required", "unnecessary"}:
|
| 146 |
+
return False
|
| 147 |
+
if token in {"required", "necessary"}:
|
| 148 |
+
return True
|
| 149 |
+
return _to_bool(token)
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
def _extract_reason(body: str) -> str | None:
|
| 153 |
+
match = _REASON_RE.search(body)
|
| 154 |
+
if match:
|
| 155 |
+
reason = match.group("value").strip()
|
| 156 |
+
reason = re.split(r"(?i)needs\s*vision", reason)[0].strip()
|
| 157 |
+
reason = reason.rstrip(".")
|
| 158 |
+
return reason or None
|
| 159 |
+
because_match = re.search(r"because\s+(.+?)(?:\.|$)", body, re.IGNORECASE)
|
| 160 |
+
if because_match:
|
| 161 |
+
reason = because_match.group(1).strip().rstrip(".")
|
| 162 |
+
return reason or None
|
| 163 |
+
return None
|
| 164 |
+
|
| 165 |
+
|
| 166 |
+
def _parse_step_block(index_guess: int, body: str) -> ReasoningStep | None:
|
| 167 |
+
statement = _extract_statement(body)
|
| 168 |
+
if not statement:
|
| 169 |
+
return None
|
| 170 |
+
needs_vision = _extract_needs_vision(body)
|
| 171 |
+
reason = _extract_reason(body)
|
| 172 |
+
index = index_guess if index_guess > 0 else 1
|
| 173 |
+
return ReasoningStep(index=index, statement=statement, needs_vision=needs_vision, reason=reason)
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
def _parse_reasoning_from_text(response_text: str, max_steps: int) -> List[ReasoningStep]:
|
| 177 |
+
text = _trim_reasoning_text(response_text)
|
| 178 |
+
text = _normalize_step_markers(text)
|
| 179 |
+
matches = list(_STEP_MARKER_RE.finditer(text))
|
| 180 |
+
if not matches:
|
| 181 |
+
return []
|
| 182 |
+
steps_map: dict[int, ReasoningStep] = {}
|
| 183 |
+
ordering: List[int] = []
|
| 184 |
+
fallback_index = 1
|
| 185 |
+
for idx, marker in enumerate(matches):
|
| 186 |
+
start = marker.end()
|
| 187 |
+
end = matches[idx + 1].start() if idx + 1 < len(matches) else len(text)
|
| 188 |
+
body = text[start:end].strip()
|
| 189 |
+
if not body:
|
| 190 |
+
continue
|
| 191 |
+
raw_index = marker.group(1) or marker.group(2)
|
| 192 |
+
try:
|
| 193 |
+
index_guess = int(raw_index) if raw_index else fallback_index
|
| 194 |
+
except (TypeError, ValueError):
|
| 195 |
+
index_guess = fallback_index
|
| 196 |
+
if raw_index is None:
|
| 197 |
+
fallback_index += 1
|
| 198 |
+
step = _parse_step_block(index_guess, body)
|
| 199 |
+
if step is None:
|
| 200 |
+
continue
|
| 201 |
+
if step.index not in steps_map:
|
| 202 |
+
ordering.append(step.index)
|
| 203 |
+
steps_map[step.index] = step
|
| 204 |
+
if len(ordering) >= max_steps:
|
| 205 |
+
break
|
| 206 |
+
return [steps_map[idx] for idx in ordering[:max_steps]]
|
| 207 |
+
|
| 208 |
+
|
| 209 |
+
def _looks_like_meta_statement(statement: str) -> bool:
|
| 210 |
+
lowered = statement.lower()
|
| 211 |
+
if any(token in lowered for token in _META_TOKENS) and "step" in lowered:
|
| 212 |
+
return True
|
| 213 |
+
if lowered.startswith(("maybe", "wait", "let's", "lets")):
|
| 214 |
+
return True
|
| 215 |
+
if len(statement) > 260 and "step" in lowered:
|
| 216 |
+
return True
|
| 217 |
+
return False
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
def _prune_steps(steps: List[ReasoningStep]) -> List[ReasoningStep]:
|
| 221 |
+
filtered: List[ReasoningStep] = []
|
| 222 |
+
seen_statements: set[str] = set()
|
| 223 |
+
for step in steps:
|
| 224 |
+
normalized = step.statement.strip().lower()
|
| 225 |
+
if _looks_like_meta_statement(step.statement):
|
| 226 |
+
continue
|
| 227 |
+
if normalized in seen_statements:
|
| 228 |
+
continue
|
| 229 |
+
seen_statements.add(normalized)
|
| 230 |
+
filtered.append(step)
|
| 231 |
+
return filtered or steps
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
def _extract_description(text: str, start_index: int) -> str | None:
|
| 235 |
+
boundary = max(text.rfind("\n", 0, start_index), text.rfind(".", 0, start_index))
|
| 236 |
+
if boundary == -1:
|
| 237 |
+
boundary = 0
|
| 238 |
+
snippet = text[boundary:start_index].strip(" \n.:–-")
|
| 239 |
+
if not snippet:
|
| 240 |
+
return None
|
| 241 |
+
return _clean_sentence(snippet)
|
| 242 |
+
|
| 243 |
+
|
| 244 |
+
def _parse_roi_from_text(response_text: str, default_step_index: int) -> List[GroundedEvidence]:
|
| 245 |
+
evidences: List[GroundedEvidence] = []
|
| 246 |
+
seen: set[tuple[float, float, float, float]] = set()
|
| 247 |
+
for match in _BOX_RE.finditer(response_text):
|
| 248 |
+
coords_str = match.group(0).strip("[]")
|
| 249 |
+
try:
|
| 250 |
+
coords = [float(part.strip()) for part in coords_str.split(",")]
|
| 251 |
+
except ValueError:
|
| 252 |
+
continue
|
| 253 |
+
if len(coords) != 4:
|
| 254 |
+
continue
|
| 255 |
+
try:
|
| 256 |
+
bbox = _normalize_bbox(coords)
|
| 257 |
+
except ValueError:
|
| 258 |
+
continue
|
| 259 |
+
key = tuple(round(c, 4) for c in bbox)
|
| 260 |
+
if key in seen:
|
| 261 |
+
continue
|
| 262 |
+
description = _extract_description(response_text, match.start())
|
| 263 |
+
evidences.append(
|
| 264 |
+
GroundedEvidence(
|
| 265 |
+
step_index=default_step_index,
|
| 266 |
+
bbox=bbox,
|
| 267 |
+
description=description,
|
| 268 |
+
confidence=None,
|
| 269 |
+
raw_source={"bbox": coords, "description": description},
|
| 270 |
+
)
|
| 271 |
+
)
|
| 272 |
+
seen.add(key)
|
| 273 |
+
return evidences
|
| 274 |
+
|
| 275 |
+
|
| 276 |
+
def parse_structured_reasoning(response_text: str, max_steps: int) -> List[ReasoningStep]:
|
| 277 |
+
"""Parse Qwen3-VL structured reasoning output into dataclasses."""
|
| 278 |
+
|
| 279 |
+
try:
|
| 280 |
+
payload = _load_first_json(response_text)
|
| 281 |
+
except ValueError as json_error:
|
| 282 |
+
steps = _parse_reasoning_from_text(response_text, max_steps=max_steps)
|
| 283 |
+
if steps:
|
| 284 |
+
return _prune_steps(steps)[:max_steps]
|
| 285 |
+
raise json_error
|
| 286 |
+
if not isinstance(payload, list):
|
| 287 |
+
raise ValueError("Structured reasoning response must be a JSON list.")
|
| 288 |
+
|
| 289 |
+
steps: List[ReasoningStep] = []
|
| 290 |
+
for idx, item in enumerate(payload, start=1):
|
| 291 |
+
if not isinstance(item, dict):
|
| 292 |
+
continue
|
| 293 |
+
statement = item.get("statement") or item.get("step") or item.get("text")
|
| 294 |
+
if not isinstance(statement, str):
|
| 295 |
+
continue
|
| 296 |
+
statement = statement.strip()
|
| 297 |
+
if not statement:
|
| 298 |
+
continue
|
| 299 |
+
step_index = item.get("index")
|
| 300 |
+
if not isinstance(step_index, int):
|
| 301 |
+
step_index = idx
|
| 302 |
+
needs_vision = _to_bool(item.get("needs_vision") or item.get("requires_vision"))
|
| 303 |
+
reason = item.get("reason") or item.get("justification")
|
| 304 |
+
if isinstance(reason, str):
|
| 305 |
+
reason = reason.strip() or None
|
| 306 |
+
else:
|
| 307 |
+
reason = None
|
| 308 |
+
steps.append(ReasoningStep(index=step_index, statement=statement, needs_vision=needs_vision, reason=reason))
|
| 309 |
+
if len(steps) >= max_steps:
|
| 310 |
+
break
|
| 311 |
+
steps = _prune_steps(steps)[:max_steps]
|
| 312 |
+
if not steps:
|
| 313 |
+
raise ValueError("No reasoning steps parsed from response.")
|
| 314 |
+
return steps
|
| 315 |
+
|
| 316 |
+
|
| 317 |
+
def _normalize_bbox(bbox: Any) -> tuple[float, float, float, float]:
|
| 318 |
+
if not isinstance(bbox, (list, tuple)) or len(bbox) != 4:
|
| 319 |
+
raise ValueError(f"Bounding box must be a list of 4 numbers, got {bbox!r}")
|
| 320 |
+
coords = []
|
| 321 |
+
for raw in bbox:
|
| 322 |
+
if isinstance(raw, str):
|
| 323 |
+
raw = raw.strip()
|
| 324 |
+
if not raw:
|
| 325 |
+
raw = 0
|
| 326 |
+
else:
|
| 327 |
+
raw = float(raw)
|
| 328 |
+
elif isinstance(raw, (int, float)):
|
| 329 |
+
raw = float(raw)
|
| 330 |
+
else:
|
| 331 |
+
raw = 0.0
|
| 332 |
+
coords.append(raw)
|
| 333 |
+
scale = max(abs(v) for v in coords) if coords else 1.0
|
| 334 |
+
if scale > 1.5: # assume 0..1000 or pixel coordinates
|
| 335 |
+
coords = [max(0.0, min(v / 1000.0, 1.0)) for v in coords]
|
| 336 |
+
else:
|
| 337 |
+
coords = [max(0.0, min(v, 1.0)) for v in coords]
|
| 338 |
+
x1, y1, x2, y2 = coords
|
| 339 |
+
x_min, x_max = sorted((x1, x2))
|
| 340 |
+
y_min, y_max = sorted((y1, y2))
|
| 341 |
+
return (x_min, y_min, x_max, y_max)
|
| 342 |
+
|
| 343 |
+
|
| 344 |
+
def parse_roi_evidence(response_text: str, default_step_index: int) -> List[GroundedEvidence]:
|
| 345 |
+
"""Parse ROI grounding output into evidence structures."""
|
| 346 |
+
|
| 347 |
+
try:
|
| 348 |
+
payload = _load_first_json(response_text)
|
| 349 |
+
except ValueError:
|
| 350 |
+
return _parse_roi_from_text(response_text, default_step_index=default_step_index)
|
| 351 |
+
if not isinstance(payload, list):
|
| 352 |
+
raise ValueError("ROI extraction response must be a JSON list.")
|
| 353 |
+
|
| 354 |
+
evidences: List[GroundedEvidence] = []
|
| 355 |
+
for item in payload:
|
| 356 |
+
if not isinstance(item, dict):
|
| 357 |
+
continue
|
| 358 |
+
raw_bbox = item.get("bbox") or item.get("bbox_2d") or item.get("box")
|
| 359 |
+
if raw_bbox is None:
|
| 360 |
+
continue
|
| 361 |
+
try:
|
| 362 |
+
bbox = _normalize_bbox(raw_bbox)
|
| 363 |
+
except ValueError:
|
| 364 |
+
continue
|
| 365 |
+
step_index = item.get("step") or item.get("step_index") or default_step_index
|
| 366 |
+
if not isinstance(step_index, int):
|
| 367 |
+
step_index = default_step_index
|
| 368 |
+
description = item.get("description") or item.get("caption") or item.get("detail")
|
| 369 |
+
if isinstance(description, str):
|
| 370 |
+
description = description.strip() or None
|
| 371 |
+
else:
|
| 372 |
+
description = None
|
| 373 |
+
confidence = item.get("confidence") or item.get("score") or item.get("probability")
|
| 374 |
+
if isinstance(confidence, str):
|
| 375 |
+
confidence = confidence.strip()
|
| 376 |
+
confidence = float(confidence) if confidence else None
|
| 377 |
+
elif isinstance(confidence, (int, float)):
|
| 378 |
+
confidence = float(confidence)
|
| 379 |
+
else:
|
| 380 |
+
confidence = None
|
| 381 |
+
evidences.append(
|
| 382 |
+
GroundedEvidence(
|
| 383 |
+
step_index=step_index,
|
| 384 |
+
bbox=bbox,
|
| 385 |
+
description=description,
|
| 386 |
+
confidence=confidence,
|
| 387 |
+
raw_source=item,
|
| 388 |
+
)
|
| 389 |
+
)
|
| 390 |
+
return evidences
|
corgi/pipeline.py
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from dataclasses import dataclass
|
| 4 |
+
from typing import List, Protocol
|
| 5 |
+
|
| 6 |
+
from PIL import Image
|
| 7 |
+
|
| 8 |
+
from .types import (
|
| 9 |
+
GroundedEvidence,
|
| 10 |
+
ReasoningStep,
|
| 11 |
+
evidences_to_serializable,
|
| 12 |
+
steps_to_serializable,
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
class SupportsQwenClient(Protocol):
|
| 17 |
+
"""Protocol describing the methods required from a Qwen3-VL client."""
|
| 18 |
+
|
| 19 |
+
def structured_reasoning(self, image: Image.Image, question: str, max_steps: int) -> List[ReasoningStep]:
|
| 20 |
+
...
|
| 21 |
+
|
| 22 |
+
def extract_step_evidence(
|
| 23 |
+
self,
|
| 24 |
+
image: Image.Image,
|
| 25 |
+
question: str,
|
| 26 |
+
step: ReasoningStep,
|
| 27 |
+
max_regions: int,
|
| 28 |
+
) -> List[GroundedEvidence]:
|
| 29 |
+
...
|
| 30 |
+
|
| 31 |
+
def synthesize_answer(
|
| 32 |
+
self,
|
| 33 |
+
image: Image.Image,
|
| 34 |
+
question: str,
|
| 35 |
+
steps: List[ReasoningStep],
|
| 36 |
+
evidences: List[GroundedEvidence],
|
| 37 |
+
) -> str:
|
| 38 |
+
...
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
@dataclass(frozen=True)
|
| 42 |
+
class PipelineResult:
|
| 43 |
+
"""Aggregated output of the CoRGI pipeline."""
|
| 44 |
+
|
| 45 |
+
question: str
|
| 46 |
+
steps: List[ReasoningStep]
|
| 47 |
+
evidence: List[GroundedEvidence]
|
| 48 |
+
answer: str
|
| 49 |
+
|
| 50 |
+
def to_json(self) -> dict:
|
| 51 |
+
return {
|
| 52 |
+
"question": self.question,
|
| 53 |
+
"steps": steps_to_serializable(self.steps),
|
| 54 |
+
"evidence": evidences_to_serializable(self.evidence),
|
| 55 |
+
"answer": self.answer,
|
| 56 |
+
}
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
class CoRGIPipeline:
|
| 60 |
+
"""Orchestrates the CoRGI reasoning pipeline using a Qwen3-VL client."""
|
| 61 |
+
|
| 62 |
+
def __init__(self, vlm_client: SupportsQwenClient):
|
| 63 |
+
if vlm_client is None:
|
| 64 |
+
raise ValueError("A Qwen3-VL client instance must be provided.")
|
| 65 |
+
self._vlm = vlm_client
|
| 66 |
+
|
| 67 |
+
def run(
|
| 68 |
+
self,
|
| 69 |
+
image: Image.Image,
|
| 70 |
+
question: str,
|
| 71 |
+
max_steps: int = 4,
|
| 72 |
+
max_regions: int = 4,
|
| 73 |
+
) -> PipelineResult:
|
| 74 |
+
steps = self._vlm.structured_reasoning(image=image, question=question, max_steps=max_steps)
|
| 75 |
+
evidences: List[GroundedEvidence] = []
|
| 76 |
+
for step in steps:
|
| 77 |
+
if not step.needs_vision:
|
| 78 |
+
continue
|
| 79 |
+
step_evs = self._vlm.extract_step_evidence(
|
| 80 |
+
image=image,
|
| 81 |
+
question=question,
|
| 82 |
+
step=step,
|
| 83 |
+
max_regions=max_regions,
|
| 84 |
+
)
|
| 85 |
+
if not step_evs:
|
| 86 |
+
continue
|
| 87 |
+
evidences.extend(step_evs[:max_regions])
|
| 88 |
+
answer = self._vlm.synthesize_answer(image=image, question=question, steps=steps, evidences=evidences)
|
| 89 |
+
return PipelineResult(question=question, steps=steps, evidence=evidences, answer=answer)
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
__all__ = ["CoRGIPipeline", "PipelineResult"]
|
corgi/qwen_client.py
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from dataclasses import dataclass
|
| 4 |
+
from typing import List, Optional
|
| 5 |
+
|
| 6 |
+
import torch
|
| 7 |
+
from PIL import Image
|
| 8 |
+
from transformers import AutoModelForImageTextToText, AutoProcessor
|
| 9 |
+
|
| 10 |
+
from .parsers import parse_roi_evidence, parse_structured_reasoning
|
| 11 |
+
from .types import GroundedEvidence, ReasoningStep
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
DEFAULT_REASONING_PROMPT = (
|
| 15 |
+
"You are a careful multimodal reasoner following the CoRGI protocol. "
|
| 16 |
+
"Given the question and the image, produce a JSON array of reasoning steps. "
|
| 17 |
+
"Each item must contain the keys: index (1-based integer), statement (concise sentence), "
|
| 18 |
+
"needs_vision (boolean true if the statement requires visual verification), and reason "
|
| 19 |
+
"(short phrase explaining why visual verification is or is not required). "
|
| 20 |
+
"Limit the number of steps to {max_steps}. Respond with JSON only; start the reply with '[' and end with ']'. "
|
| 21 |
+
"Do not add any commentary or prose outside of the JSON."
|
| 22 |
+
)
|
| 23 |
+
|
| 24 |
+
DEFAULT_GROUNDING_PROMPT = (
|
| 25 |
+
"You are validating the following reasoning step:\n"
|
| 26 |
+
"{step_statement}\n"
|
| 27 |
+
"Return a JSON array with up to {max_regions} region candidates that help verify the step. "
|
| 28 |
+
"Each object must include: step (integer), bbox (list of four numbers x1,y1,x2,y2, "
|
| 29 |
+
"either normalized 0-1 or scaled 0-1000), description (short textual evidence), "
|
| 30 |
+
"and confidence (0-1). Use [] if no relevant region exists. "
|
| 31 |
+
"Respond with JSON only; do not include explanations outside the JSON array."
|
| 32 |
+
)
|
| 33 |
+
|
| 34 |
+
DEFAULT_ANSWER_PROMPT = (
|
| 35 |
+
"You are finalizing the answer using verified evidence. "
|
| 36 |
+
"Question: {question}\n"
|
| 37 |
+
"Structured reasoning steps:\n"
|
| 38 |
+
"{steps}\n"
|
| 39 |
+
"Verified evidence items:\n"
|
| 40 |
+
"{evidence}\n"
|
| 41 |
+
"Respond with a concise final answer sentence grounded in the evidence. "
|
| 42 |
+
"If unsure, say you are uncertain. Do not include <think> tags or internal monologue."
|
| 43 |
+
)
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def _format_steps_for_prompt(steps: List[ReasoningStep]) -> str:
|
| 47 |
+
return "\n".join(
|
| 48 |
+
f"{step.index}. {step.statement} (needs vision: {step.needs_vision})"
|
| 49 |
+
for step in steps
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def _format_evidence_for_prompt(evidences: List[GroundedEvidence]) -> str:
|
| 54 |
+
if not evidences:
|
| 55 |
+
return "No evidence collected."
|
| 56 |
+
lines = []
|
| 57 |
+
for ev in evidences:
|
| 58 |
+
desc = ev.description or "No description"
|
| 59 |
+
bbox = ", ".join(f"{coord:.2f}" for coord in ev.bbox)
|
| 60 |
+
conf = f"{ev.confidence:.2f}" if ev.confidence is not None else "n/a"
|
| 61 |
+
lines.append(f"Step {ev.step_index}: bbox=({bbox}), conf={conf}, desc={desc}")
|
| 62 |
+
return "\n".join(lines)
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def _strip_think_content(text: str) -> str:
|
| 66 |
+
if not text:
|
| 67 |
+
return ""
|
| 68 |
+
cleaned = text
|
| 69 |
+
if "</think>" in cleaned:
|
| 70 |
+
cleaned = cleaned.split("</think>", 1)[-1]
|
| 71 |
+
cleaned = cleaned.replace("<think>", "")
|
| 72 |
+
return cleaned.strip()
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
@dataclass
|
| 76 |
+
class QwenGenerationConfig:
|
| 77 |
+
model_id: str = "Qwen/Qwen3-VL-8B-Thinking"
|
| 78 |
+
max_new_tokens: int = 512
|
| 79 |
+
temperature: float | None = None
|
| 80 |
+
do_sample: bool = False
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
class Qwen3VLClient:
|
| 84 |
+
"""Wrapper around transformers Qwen3-VL chat API for CoRGI pipeline."""
|
| 85 |
+
|
| 86 |
+
def __init__(
|
| 87 |
+
self,
|
| 88 |
+
config: Optional[QwenGenerationConfig] = None,
|
| 89 |
+
) -> None:
|
| 90 |
+
self.config = config or QwenGenerationConfig()
|
| 91 |
+
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
| 92 |
+
self._model = AutoModelForImageTextToText.from_pretrained(
|
| 93 |
+
self.config.model_id,
|
| 94 |
+
torch_dtype=torch_dtype,
|
| 95 |
+
device_map="auto",
|
| 96 |
+
)
|
| 97 |
+
self._processor = AutoProcessor.from_pretrained(self.config.model_id)
|
| 98 |
+
|
| 99 |
+
def _chat(
|
| 100 |
+
self,
|
| 101 |
+
image: Image.Image,
|
| 102 |
+
prompt: str,
|
| 103 |
+
max_new_tokens: Optional[int] = None,
|
| 104 |
+
) -> str:
|
| 105 |
+
messages = [
|
| 106 |
+
{
|
| 107 |
+
"role": "user",
|
| 108 |
+
"content": [
|
| 109 |
+
{"type": "image", "image": image},
|
| 110 |
+
{"type": "text", "text": prompt},
|
| 111 |
+
],
|
| 112 |
+
}
|
| 113 |
+
]
|
| 114 |
+
chat_prompt = self._processor.apply_chat_template(
|
| 115 |
+
messages,
|
| 116 |
+
add_generation_prompt=True,
|
| 117 |
+
tokenize=False,
|
| 118 |
+
)
|
| 119 |
+
inputs = self._processor(
|
| 120 |
+
text=[chat_prompt],
|
| 121 |
+
images=[image],
|
| 122 |
+
return_tensors="pt",
|
| 123 |
+
).to(self._model.device)
|
| 124 |
+
gen_kwargs = {
|
| 125 |
+
"max_new_tokens": max_new_tokens or self.config.max_new_tokens,
|
| 126 |
+
"do_sample": self.config.do_sample,
|
| 127 |
+
}
|
| 128 |
+
if self.config.do_sample and self.config.temperature is not None:
|
| 129 |
+
gen_kwargs["temperature"] = self.config.temperature
|
| 130 |
+
output_ids = self._model.generate(**inputs, **gen_kwargs)
|
| 131 |
+
prompt_length = inputs.input_ids.shape[1]
|
| 132 |
+
generated_tokens = output_ids[:, prompt_length:]
|
| 133 |
+
response = self._processor.batch_decode(
|
| 134 |
+
generated_tokens,
|
| 135 |
+
skip_special_tokens=True,
|
| 136 |
+
clean_up_tokenization_spaces=False,
|
| 137 |
+
)[0]
|
| 138 |
+
return response.strip()
|
| 139 |
+
|
| 140 |
+
def structured_reasoning(self, image: Image.Image, question: str, max_steps: int) -> List[ReasoningStep]:
|
| 141 |
+
prompt = DEFAULT_REASONING_PROMPT.format(max_steps=max_steps) + f"\nQuestion: {question}"
|
| 142 |
+
response = self._chat(image=image, prompt=prompt)
|
| 143 |
+
return parse_structured_reasoning(response, max_steps=max_steps)
|
| 144 |
+
|
| 145 |
+
def extract_step_evidence(
|
| 146 |
+
self,
|
| 147 |
+
image: Image.Image,
|
| 148 |
+
question: str,
|
| 149 |
+
step: ReasoningStep,
|
| 150 |
+
max_regions: int,
|
| 151 |
+
) -> List[GroundedEvidence]:
|
| 152 |
+
prompt = DEFAULT_GROUNDING_PROMPT.format(
|
| 153 |
+
step_statement=step.statement,
|
| 154 |
+
max_regions=max_regions,
|
| 155 |
+
)
|
| 156 |
+
response = self._chat(image=image, prompt=prompt, max_new_tokens=256)
|
| 157 |
+
evidences = parse_roi_evidence(response, default_step_index=step.index)
|
| 158 |
+
return evidences[:max_regions]
|
| 159 |
+
|
| 160 |
+
def synthesize_answer(
|
| 161 |
+
self,
|
| 162 |
+
image: Image.Image,
|
| 163 |
+
question: str,
|
| 164 |
+
steps: List[ReasoningStep],
|
| 165 |
+
evidences: List[GroundedEvidence],
|
| 166 |
+
) -> str:
|
| 167 |
+
prompt = DEFAULT_ANSWER_PROMPT.format(
|
| 168 |
+
question=question,
|
| 169 |
+
steps=_format_steps_for_prompt(steps),
|
| 170 |
+
evidence=_format_evidence_for_prompt(evidences),
|
| 171 |
+
)
|
| 172 |
+
response = self._chat(image=image, prompt=prompt, max_new_tokens=256)
|
| 173 |
+
return _strip_think_content(response)
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
__all__ = ["Qwen3VLClient", "QwenGenerationConfig"]
|
corgi/types.py
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from dataclasses import dataclass
|
| 4 |
+
from typing import Dict, List, Optional, Tuple
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
BBox = Tuple[float, float, float, float]
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
@dataclass(frozen=True)
|
| 11 |
+
class ReasoningStep:
|
| 12 |
+
"""Represents a single structured reasoning step."""
|
| 13 |
+
|
| 14 |
+
index: int
|
| 15 |
+
statement: str
|
| 16 |
+
needs_vision: bool
|
| 17 |
+
reason: Optional[str] = None
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
@dataclass(frozen=True)
|
| 21 |
+
class GroundedEvidence:
|
| 22 |
+
"""Evidence item grounded to a region of interest in the image."""
|
| 23 |
+
|
| 24 |
+
step_index: int
|
| 25 |
+
bbox: BBox
|
| 26 |
+
description: Optional[str] = None
|
| 27 |
+
confidence: Optional[float] = None
|
| 28 |
+
raw_source: Optional[Dict[str, object]] = None
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def steps_to_serializable(steps: List[ReasoningStep]) -> List[Dict[str, object]]:
|
| 32 |
+
"""Helper to convert steps into JSON-friendly dictionaries."""
|
| 33 |
+
|
| 34 |
+
return [
|
| 35 |
+
{
|
| 36 |
+
"index": step.index,
|
| 37 |
+
"statement": step.statement,
|
| 38 |
+
"needs_vision": step.needs_vision,
|
| 39 |
+
**({"reason": step.reason} if step.reason is not None else {}),
|
| 40 |
+
}
|
| 41 |
+
for step in steps
|
| 42 |
+
]
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def evidences_to_serializable(evidences: List[GroundedEvidence]) -> List[Dict[str, object]]:
|
| 46 |
+
"""Helper to convert evidences into JSON-friendly dictionaries."""
|
| 47 |
+
|
| 48 |
+
serializable: List[Dict[str, object]] = []
|
| 49 |
+
for ev in evidences:
|
| 50 |
+
item: Dict[str, object] = {
|
| 51 |
+
"step_index": ev.step_index,
|
| 52 |
+
"bbox": list(ev.bbox),
|
| 53 |
+
}
|
| 54 |
+
if ev.description is not None:
|
| 55 |
+
item["description"] = ev.description
|
| 56 |
+
if ev.confidence is not None:
|
| 57 |
+
item["confidence"] = ev.confidence
|
| 58 |
+
if ev.raw_source is not None:
|
| 59 |
+
item["raw_source"] = ev.raw_source
|
| 60 |
+
serializable.append(item)
|
| 61 |
+
return serializable
|
examples/demo_qwen_corgi.py
ADDED
|
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python
|
| 2 |
+
"""Run CoRGI pipeline on the Qwen3-VL demo image and question.
|
| 3 |
+
|
| 4 |
+
Usage:
|
| 5 |
+
python examples/demo_qwen_corgi.py [--model-id Qwen/Qwen3-VL-8B-Thinking]
|
| 6 |
+
|
| 7 |
+
If the demo image cannot be downloaded automatically, set the environment
|
| 8 |
+
variable `CORGI_DEMO_IMAGE` to a local file path.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
import argparse
|
| 14 |
+
import os
|
| 15 |
+
from io import BytesIO
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
from urllib.request import urlopen
|
| 18 |
+
|
| 19 |
+
from PIL import Image
|
| 20 |
+
|
| 21 |
+
from corgi.pipeline import CoRGIPipeline
|
| 22 |
+
from corgi.qwen_client import Qwen3VLClient, QwenGenerationConfig
|
| 23 |
+
|
| 24 |
+
DEMO_IMAGE_URL = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
|
| 25 |
+
DEMO_QUESTION = "How many people are there in the image? Is there any one who is wearing a white watch?"
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def fetch_demo_image() -> Image.Image:
|
| 29 |
+
if path := os.getenv("CORGI_DEMO_IMAGE"):
|
| 30 |
+
return Image.open(path).convert("RGB")
|
| 31 |
+
with urlopen(DEMO_IMAGE_URL) as resp: # nosec B310 - trusted URL from official demo asset
|
| 32 |
+
data = resp.read()
|
| 33 |
+
return Image.open(BytesIO(data)).convert("RGB")
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def format_steps(pipeline_result) -> str:
|
| 37 |
+
lines = ["Reasoning steps:"]
|
| 38 |
+
for step in pipeline_result.steps:
|
| 39 |
+
needs = "yes" if step.needs_vision else "no"
|
| 40 |
+
reason = f" (reason: {step.reason})" if step.reason else ""
|
| 41 |
+
lines.append(f" [{step.index}] {step.statement} — needs vision: {needs}{reason}")
|
| 42 |
+
return "\n".join(lines)
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def format_evidence(pipeline_result) -> str:
|
| 46 |
+
lines = ["Visual evidence:"]
|
| 47 |
+
if not pipeline_result.evidence:
|
| 48 |
+
lines.append(" (no evidence returned)")
|
| 49 |
+
return "\n".join(lines)
|
| 50 |
+
for ev in pipeline_result.evidence:
|
| 51 |
+
bbox = ", ".join(f"{coord:.2f}" for coord in ev.bbox)
|
| 52 |
+
desc = ev.description or "(no description)"
|
| 53 |
+
conf = f", conf={ev.confidence:.2f}" if ev.confidence is not None else ""
|
| 54 |
+
lines.append(f" Step {ev.step_index}: bbox=({bbox}), desc={desc}{conf}")
|
| 55 |
+
return "\n".join(lines)
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
def main() -> int:
|
| 59 |
+
parser = argparse.ArgumentParser(description="Run CoRGI pipeline with the real Qwen3-VL model.")
|
| 60 |
+
parser.add_argument("--model-id", default="Qwen/Qwen3-VL-8B-Thinking", help="Hugging Face model id for Qwen3-VL")
|
| 61 |
+
parser.add_argument("--max-steps", type=int, default=4)
|
| 62 |
+
parser.add_argument("--max-regions", type=int, default=4)
|
| 63 |
+
args = parser.parse_args()
|
| 64 |
+
|
| 65 |
+
image = fetch_demo_image()
|
| 66 |
+
client = Qwen3VLClient(QwenGenerationConfig(model_id=args.model_id))
|
| 67 |
+
pipeline = CoRGIPipeline(client)
|
| 68 |
+
|
| 69 |
+
result = pipeline.run(
|
| 70 |
+
image=image,
|
| 71 |
+
question=DEMO_QUESTION,
|
| 72 |
+
max_steps=args.max_steps,
|
| 73 |
+
max_regions=args.max_regions,
|
| 74 |
+
)
|
| 75 |
+
|
| 76 |
+
print(f"Question: {DEMO_QUESTION}")
|
| 77 |
+
print(format_steps(result))
|
| 78 |
+
print(format_evidence(result))
|
| 79 |
+
print("Answer:", result.answer)
|
| 80 |
+
|
| 81 |
+
return 0
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
if __name__ == "__main__":
|
| 85 |
+
raise SystemExit(main())
|
requirements.txt
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
accelerate>=0.34
|
| 2 |
+
transformers>=4.45
|
| 3 |
+
pillow
|
| 4 |
+
torch
|
| 5 |
+
gradio>=4.44
|
| 6 |
+
hydra-core
|
| 7 |
+
antlr4-python3-runtime
|