Enhance model card for Uni-CoT with metadata, overview, and usage (#2)

Browse files

- Enhance model card for Uni-CoT with metadata, overview, and usage (3ab5abfbd4d065dc29ddd01c3c8961999fa4248c)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +186 -3

README.md CHANGED Viewed

@@ -1,3 +1,186 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+pipeline_tag: any-to-any
+library_name: transformers
+---
+<p align="center">
+  <img src="https://github.com/Fr0zenCrane/UniCoT/raw/main/assets/logo.png" alt="Uni-CoT" width="400"/>
+</p>
+# Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
+This repository contains the **UniCoT-7B-MoT** model, a unified Chain-of-Thought (CoT) framework for multimodal reasoning across text and vision. It was introduced in the paper [Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision](https://huggingface.co/papers/2508.05606).
+**Project Page**: [https://sais-fuxi.github.io/projects/uni-cot/](https://sais-fuxi.github.io/projects/uni-cot/)
+**Code**: [https://github.com/Fr0zenCrane/UniCoT](https://github.com/Fr0zenCrane/UniCoT)
+<p align="center">
+  <img src="https://github.com/Fr0zenCrane/UniCoT/raw/main/assets/teaser.png" width="900"/>
+</p>
+## Overview
+Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. Uni-CoT extends these principles to vision-language reasoning, enabling coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states.
+The Uni-CoT framework adopts a novel two-level hierarchical reasoning architecture:
+1.  **Macro-Level CoT**: Decomposes a complex task into simpler subtasks and synthesizes their outcomes to derive the final answer. This includes strategies like Sequential, Parallel, and Progressive Refinement Decomposition.
+2.  **Micro-Level CoT**: Focuses on executing individual subtasks, incorporating a *Self-Check (Self-Reflection) Mechanism* to ensure stable and high-quality results.
+This design significantly reduces computational overhead, allowing Uni-CoT to perform scalable and coherent multi-modal reasoning. It aims to solve complex multimodal tasks, including:
+*   🎨 Reliable image generation and editing
+*   🔍 Visual and physical reasoning
+*   🧩 Visual planning
+*   📖 Multimodal story understanding
+### 🧠 Reasoning Pipeline
+<p align="center">
+  <img src="https://github.com/Fr0zenCrane/UniCoT/raw/main/assets/pipeline.png" width="900"/>
+</p>
+## Quickstart
+### Installation
+The environment setup of Uni-CoT is consistent with its base model, [Bagel](https://github.com/ByteDance-Seed/Bagel).
+```bash
+git clone https://github.com/Fr0zenCrane/UniCoT.git
+cd UniCoT
+conda create -n unicot python=3.10 -y
+conda activate unicot
+pip install -r requirements.txt
+pip install flash_attn==2.5.8 --no-build-isolation
+```
+### Model Download
+You may directly download the huggingface [checkpoint](https://huggingface.co/Fr0zencr4nE/UniCoT-7B-MoT) or use the following script:
+```python
+from huggingface_hub import snapshot_download
+save_dir = "models/UniCoT-7B-MoT"
+repo_id = "Fr0zencr4nE/UniCoT-7B-MoT"
+cache_dir = save_dir + "/cache"
+snapshot_download(cache_dir=cache_dir,
+  local_dir=save_dir,
+  repo_id=repo_id,
+  local_dir_use_symlinks=False,
+  resume_download=True,
+  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
+)
+```
+### Self-check Reasoning
+To perform evaluation or general inference using UniCoT-7B-MoT, you need at least one GPU with 40GB or more VRAM.
+#### Evaluation on WISE benchmark
+To reproduce the results on WISE benchmark, you can use the script `./scripts/run_wise_self_reflection.sh`. Specify your local checkpoint of UniCoT-7B-MoT and output directory using `--model_path` and `outdir`.
+```bash
+gpu_num=8
+for i in $(seq 0 $((gpu_num-1)));
+do
+    CUDA_VISIBLE_DEVICES=$i python inference_mdp_self_reflection_wise.py \
+        --group_id $i \
+        --group_num $gpu_num \
+        --model_path "Fr0zencr4nE/UniCoT-7B-MoT" \
+        --data_path "./eval/gen/wise/final_data.json" \
+        --outdir "./results" \
+        --cfg_text_scale 4 > process_log_$i.log 2>&1 &
+done
+wait
+echo "All background processes finished."
+```
+#### General Inference
+For general inference, prepare your prompts by formatting them into a `.txt` file, with one prompt per line (e.g., `test_prompts.txt`). Then, use the script `./scripts/run_user_self_reflection.sh` to generate images from your prompts with the added benefit of the self-reflection mechanism.
+```bash
+gpu_num=8
+for i in $(seq 0 $((gpu_num-1)));
+do
+    CUDA_VISIBLE_DEVICES=$i python inference_mdp_self_reflection.py \
+        --group_id $i \
+        --group_num $gpu_num \
+        --model_path "Fr0zencr4nE/UniCoT-7B-MoT" \
+        --data_path "./test_prompts.txt" \
+        --outdir "./results" \
+        --cfg_text_scale 4 > process_log_$i.log 2>&1 &
+done
+wait
+echo "All background processes finished."
+```
+## Preliminary Results
+### Qualitative Results for Image Generation
+<p align="left">
+  <img src="https://github.com/Fr0zenCrane/UniCoT/raw/main/assets/qualitative_results_generation.png" width="800"/>
+### Qualitative Results for Image Editing
+<p align="left">
+  <img src="https://github.com/Fr0zenCrane/UniCoT/raw/main/assets/qualitative_results_editing.png" width="800"/>
+### Quantitative Results on WISE
+We first conduct experiments on the [WISE](https://github.com/PKU-YuanGroup/WISE) dataset to evaluate the reasoning capabilities of our method. As shown in the table below, our model achieves state-of-the-art (SOTA) performance among existing open-source unified models. Our results are averaged over five independent runs to ensure robustness and reliability.
+|               | Culture↑ | Time↑   | Space↑  | Biology↑ | Physics↑ | Chemistry↑ | Overall↑ |
+|---------------|----------|---------|---------|----------|----------|------------|----------|
+| Janus         | 0.16     | 0.26    | 0.35    | 0.28     | 0.30     | 0.14       | 0.23     |
+| MetaQuery     | 0.56     | 0.55    | 0.62    | 0.49     | 0.63     | 0.41       | 0.55     |
+| Bagel-Think   | 0.76     | 0.69    | 0.75    | 0.65     | 0.75     | 0.58       | 0.70     |
+| **Uni-CoT**   | **0.76**±0.009 | **0.70**±0.0256 | **0.76**±0.006 | **0.73**±0.021 | **0.81**±0.018 | **0.73**±0.020 | **0.75**±0.013 |
+| *GPT4O*       | *0.81*   | *0.71*  | *0.89*  | *0.83*   | *0.79*   | *0.74*     | *0.80*   |
+Furthermore, we apply our self-check mechanism to the images generated by the original Bagel model with think mode, aiming to evaluate our method’s ability to calibrate erroneous outputs.
+The results in the table below demonstrate that our model effectively refines the imperfect outputs generated by Bagel.
+|               | Culture↑ | Time↑   | Space↑  | Biology↑ | Physics↑ | Chemistry↑ | Overall↑ |
+|---------------|----------|---------|---------|----------|----------|------------|----------|
+| Bagel-Think         | 0.76   | 0.69  | 0.75 | 0.65       | 0.75 | 0.58        | 0.70 |
+| Bagel-Think+Uni-CoT | 0.75 | 0.70     | 0.75 | 0.71   | 0.74        | 0.69 | 0.73     |
+| **Uni-CoT**   | **0.76**±0.009 | **0.70**±0.0256 | **0.76**±0.006 | **0.73**±0.021 | **0.81**±0.018 | **0.73**±0.020 | **0.75**±0.013 |
+| *GPT4O*       | *0.81*     | *0.71*       | *0.89*      | *0.83*     | *0.79*      | *0.74*      | *0.80*       |
+### Quantitative Results on [KRIS Bench](https://github.com/mercurystraw/Kris_Bench)
+We also achieve state-of-the-art (SOTA) performance on the KRIS benchmark, even surpassing the closed-source model Gemini2.0.
+| Model           | Attribute Perception | Spatial Perception | Temporal Perception | Factual Avg | Social Science | Natural Science | Conceptual Avg | Logical Reasoning | Instruction Decomposition | Procedural Avg | Overall Score |
+|----------------|----------------------|---------------------|----------------------|-------------|----------------|------------------|----------------|--------------------|-----------------------------|----------------|----------------|
+| Gemini 2.0 (Google)        | 66.33               | 63.33              | 63.92               | 65.26      | 68.19         | 56.94           | 59.65         | 54.13              | 71.67                       | 62.90          | 62.41           |
+| Step 3∅ vision (StepFun)   | 69.67               | 61.08              | 63.25               | 66.70      | 66.88         | 60.88           | 62.32         | 49.06              | 54.92                       | 51.99          | 61.43           |
+| Doubao (ByteDance)         | 70.92               | 59.17              | 40.58               | 63.30      | 65.50         | 61.19           | 62.23         | 47.75              | 60.58                       | 54.17          | 60.70           |
+| BAGEL (ByteDance)          | 64.27               | 62.42              | 42.45               | 60.26      | 55.40         | 56.01           | 55.86         | 52.54              | 50.56                       | 51.69          | 56.21           |
+| BAGEL-Think (ByteDance)| 67.42               | 68.33              | 58.67               | 66.18      | 63.55         | 61.40           | 61.92         | 48.12              | 50.22                       | 49.02          | 60.18        |
+| **Uni-Cot**       | **72.76**               | **72.87**              | **67.10**               | **71.85**      | **70.81**         | **66.00**           | **67.16**         | **53.43**              | **73.93**                       | **63.68**          | **68.00**         |
+| *GPT-4o* (OpenAI)        | *83.17*               | *79.08*              | *68.25*               | *79.80*      | *85.50*         | *80.06*           | *81.37*         | *71.56*              | *85.08*                       | *78.32*          | *80.09*         |
+## Citation
+If you find Uni-CoT useful for your research, please consider citing the paper:
+```bibtex
+@misc{qin2025unicot,
+      title={Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision},
+      author={Luozheng Qin and Jia Gong and Yuqing Sun and Tianjiao Li and Mengping Yang and Xiaomeng Yang and Chao Qu and Zhiyu Tan and Hao Li},
+      year={2025},
+      eprint={2508.05606},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2508.05606},
+}
+```
+## Acknowledgement
+We acknowledge the contributions of the following projects that inspired and supported Uni-CoT:
+- [Bagel](https://github.com/ByteDance-Seed/Bagel) proposed by ByteDance-Seed team.
+- [WISE](https://github.com/PKU-YuanGroup/WISE) proposed by PKU-YuanGroup.
+- [KRIS-Bench](https://github.com/mercurystraw/Kris_Bench) proposed by Stepfun.
+- [RISE-Bench](https://github.com/PhoenixZ810/RISEBench) proposed by Shanghai AI Lab.