Any-to-Any
Transformers
Fr0zencr4nE nielsr HF Staff commited on
Commit
d8bdc4d
·
verified ·
1 Parent(s): 6f448f6

Enhance model card for Uni-CoT with metadata, overview, and usage (#2)

Browse files

- Enhance model card for Uni-CoT with metadata, overview, and usage (3ab5abfbd4d065dc29ddd01c3c8961999fa4248c)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +186 -3
README.md CHANGED
@@ -1,3 +1,186 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: any-to-any
4
+ library_name: transformers
5
+ ---
6
+
7
+ <p align="center">
8
+ <img src="https://github.com/Fr0zenCrane/UniCoT/raw/main/assets/logo.png" alt="Uni-CoT" width="400"/>
9
+ </p>
10
+
11
+ # Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
12
+
13
+ This repository contains the **UniCoT-7B-MoT** model, a unified Chain-of-Thought (CoT) framework for multimodal reasoning across text and vision. It was introduced in the paper [Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision](https://huggingface.co/papers/2508.05606).
14
+
15
+ **Project Page**: [https://sais-fuxi.github.io/projects/uni-cot/](https://sais-fuxi.github.io/projects/uni-cot/)
16
+ **Code**: [https://github.com/Fr0zenCrane/UniCoT](https://github.com/Fr0zenCrane/UniCoT)
17
+
18
+ <p align="center">
19
+ <img src="https://github.com/Fr0zenCrane/UniCoT/raw/main/assets/teaser.png" width="900"/>
20
+ </p>
21
+
22
+ ## Overview
23
+ Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. Uni-CoT extends these principles to vision-language reasoning, enabling coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states.
24
+
25
+ The Uni-CoT framework adopts a novel two-level hierarchical reasoning architecture:
26
+
27
+ 1. **Macro-Level CoT**: Decomposes a complex task into simpler subtasks and synthesizes their outcomes to derive the final answer. This includes strategies like Sequential, Parallel, and Progressive Refinement Decomposition.
28
+ 2. **Micro-Level CoT**: Focuses on executing individual subtasks, incorporating a *Self-Check (Self-Reflection) Mechanism* to ensure stable and high-quality results.
29
+
30
+ This design significantly reduces computational overhead, allowing Uni-CoT to perform scalable and coherent multi-modal reasoning. It aims to solve complex multimodal tasks, including:
31
+ * 🎨 Reliable image generation and editing
32
+ * 🔍 Visual and physical reasoning
33
+ * 🧩 Visual planning
34
+ * 📖 Multimodal story understanding
35
+
36
+ ### 🧠 Reasoning Pipeline
37
+ <p align="center">
38
+ <img src="https://github.com/Fr0zenCrane/UniCoT/raw/main/assets/pipeline.png" width="900"/>
39
+ </p>
40
+
41
+ ## Quickstart
42
+
43
+ ### Installation
44
+
45
+ The environment setup of Uni-CoT is consistent with its base model, [Bagel](https://github.com/ByteDance-Seed/Bagel).
46
+
47
+ ```bash
48
+ git clone https://github.com/Fr0zenCrane/UniCoT.git
49
+ cd UniCoT
50
+ conda create -n unicot python=3.10 -y
51
+ conda activate unicot
52
+ pip install -r requirements.txt
53
+ pip install flash_attn==2.5.8 --no-build-isolation
54
+ ```
55
+
56
+ ### Model Download
57
+
58
+ You may directly download the huggingface [checkpoint](https://huggingface.co/Fr0zencr4nE/UniCoT-7B-MoT) or use the following script:
59
+
60
+ ```python
61
+ from huggingface_hub import snapshot_download
62
+
63
+ save_dir = "models/UniCoT-7B-MoT"
64
+ repo_id = "Fr0zencr4nE/UniCoT-7B-MoT"
65
+ cache_dir = save_dir + "/cache"
66
+
67
+ snapshot_download(cache_dir=cache_dir,
68
+ local_dir=save_dir,
69
+ repo_id=repo_id,
70
+ local_dir_use_symlinks=False,
71
+ resume_download=True,
72
+ allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
73
+ )
74
+ ```
75
+
76
+ ### Self-check Reasoning
77
+
78
+ To perform evaluation or general inference using UniCoT-7B-MoT, you need at least one GPU with 40GB or more VRAM.
79
+
80
+ #### Evaluation on WISE benchmark
81
+ To reproduce the results on WISE benchmark, you can use the script `./scripts/run_wise_self_reflection.sh`. Specify your local checkpoint of UniCoT-7B-MoT and output directory using `--model_path` and `outdir`.
82
+
83
+ ```bash
84
+ gpu_num=8
85
+
86
+ for i in $(seq 0 $((gpu_num-1)));
87
+ do
88
+ CUDA_VISIBLE_DEVICES=$i python inference_mdp_self_reflection_wise.py \
89
+ --group_id $i \
90
+ --group_num $gpu_num \
91
+ --model_path "Fr0zencr4nE/UniCoT-7B-MoT" \
92
+ --data_path "./eval/gen/wise/final_data.json" \
93
+ --outdir "./results" \
94
+ --cfg_text_scale 4 > process_log_$i.log 2>&1 &
95
+ done
96
+
97
+ wait
98
+ echo "All background processes finished."
99
+ ```
100
+
101
+ #### General Inference
102
+ For general inference, prepare your prompts by formatting them into a `.txt` file, with one prompt per line (e.g., `test_prompts.txt`). Then, use the script `./scripts/run_user_self_reflection.sh` to generate images from your prompts with the added benefit of the self-reflection mechanism.
103
+
104
+ ```bash
105
+ gpu_num=8
106
+
107
+ for i in $(seq 0 $((gpu_num-1)));
108
+ do
109
+ CUDA_VISIBLE_DEVICES=$i python inference_mdp_self_reflection.py \
110
+ --group_id $i \
111
+ --group_num $gpu_num \
112
+ --model_path "Fr0zencr4nE/UniCoT-7B-MoT" \
113
+ --data_path "./test_prompts.txt" \
114
+ --outdir "./results" \
115
+ --cfg_text_scale 4 > process_log_$i.log 2>&1 &
116
+ done
117
+
118
+ wait
119
+ echo "All background processes finished."
120
+ ```
121
+
122
+ ## Preliminary Results
123
+ ### Qualitative Results for Image Generation
124
+ <p align="left">
125
+ <img src="https://github.com/Fr0zenCrane/UniCoT/raw/main/assets/qualitative_results_generation.png" width="800"/>
126
+
127
+ ### Qualitative Results for Image Editing
128
+ <p align="left">
129
+ <img src="https://github.com/Fr0zenCrane/UniCoT/raw/main/assets/qualitative_results_editing.png" width="800"/>
130
+
131
+ ### Quantitative Results on WISE
132
+ We first conduct experiments on the [WISE](https://github.com/PKU-YuanGroup/WISE) dataset to evaluate the reasoning capabilities of our method. As shown in the table below, our model achieves state-of-the-art (SOTA) performance among existing open-source unified models. Our results are averaged over five independent runs to ensure robustness and reliability.
133
+
134
+ | | Culture↑ | Time↑ | Space↑ | Biology↑ | Physics↑ | Chemistry↑ | Overall↑ |
135
+ |---------------|----------|---------|---------|----------|----------|------------|----------|
136
+ | Janus | 0.16 | 0.26 | 0.35 | 0.28 | 0.30 | 0.14 | 0.23 |
137
+ | MetaQuery | 0.56 | 0.55 | 0.62 | 0.49 | 0.63 | 0.41 | 0.55 |
138
+ | Bagel-Think | 0.76 | 0.69 | 0.75 | 0.65 | 0.75 | 0.58 | 0.70 |
139
+ | **Uni-CoT** | **0.76**±0.009 | **0.70**±0.0256 | **0.76**±0.006 | **0.73**±0.021 | **0.81**±0.018 | **0.73**±0.020 | **0.75**±0.013 |
140
+ | *GPT4O* | *0.81* | *0.71* | *0.89* | *0.83* | *0.79* | *0.74* | *0.80* |
141
+
142
+ Furthermore, we apply our self-check mechanism to the images generated by the original Bagel model with think mode, aiming to evaluate our method’s ability to calibrate erroneous outputs.
143
+ The results in the table below demonstrate that our model effectively refines the imperfect outputs generated by Bagel.
144
+
145
+ | | Culture↑ | Time↑ | Space↑ | Biology↑ | Physics↑ | Chemistry↑ | Overall↑ |
146
+ |---------------|----------|---------|---------|----------|----------|------------|----------|
147
+ | Bagel-Think | 0.76 | 0.69 | 0.75 | 0.65 | 0.75 | 0.58 | 0.70 |
148
+ | Bagel-Think+Uni-CoT | 0.75 | 0.70 | 0.75 | 0.71 | 0.74 | 0.69 | 0.73 |
149
+ | **Uni-CoT** | **0.76**±0.009 | **0.70**±0.0256 | **0.76**±0.006 | **0.73**±0.021 | **0.81**±0.018 | **0.73**±0.020 | **0.75**±0.013 |
150
+ | *GPT4O* | *0.81* | *0.71* | *0.89* | *0.83* | *0.79* | *0.74* | *0.80* |
151
+
152
+ ### Quantitative Results on [KRIS Bench](https://github.com/mercurystraw/Kris_Bench)
153
+ We also achieve state-of-the-art (SOTA) performance on the KRIS benchmark, even surpassing the closed-source model Gemini2.0.
154
+ | Model | Attribute Perception | Spatial Perception | Temporal Perception | Factual Avg | Social Science | Natural Science | Conceptual Avg | Logical Reasoning | Instruction Decomposition | Procedural Avg | Overall Score |
155
+ |----------------|----------------------|---------------------|----------------------|-------------|----------------|------------------|----------------|--------------------|-----------------------------|----------------|----------------|
156
+ | Gemini 2.0 (Google) | 66.33 | 63.33 | 63.92 | 65.26 | 68.19 | 56.94 | 59.65 | 54.13 | 71.67 | 62.90 | 62.41 |
157
+ | Step 3∅ vision (StepFun) | 69.67 | 61.08 | 63.25 | 66.70 | 66.88 | 60.88 | 62.32 | 49.06 | 54.92 | 51.99 | 61.43 |
158
+ | Doubao (ByteDance) | 70.92 | 59.17 | 40.58 | 63.30 | 65.50 | 61.19 | 62.23 | 47.75 | 60.58 | 54.17 | 60.70 |
159
+ | BAGEL (ByteDance) | 64.27 | 62.42 | 42.45 | 60.26 | 55.40 | 56.01 | 55.86 | 52.54 | 50.56 | 51.69 | 56.21 |
160
+ | BAGEL-Think (ByteDance)| 67.42 | 68.33 | 58.67 | 66.18 | 63.55 | 61.40 | 61.92 | 48.12 | 50.22 | 49.02 | 60.18 |
161
+ | **Uni-Cot** | **72.76** | **72.87** | **67.10** | **71.85** | **70.81** | **66.00** | **67.16** | **53.43** | **73.93** | **63.68** | **68.00** |
162
+ | *GPT-4o* (OpenAI) | *83.17* | *79.08* | *68.25* | *79.80* | *85.50* | *80.06* | *81.37* | *71.56* | *85.08* | *78.32* | *80.09* |
163
+
164
+ ## Citation
165
+
166
+ If you find Uni-CoT useful for your research, please consider citing the paper:
167
+
168
+ ```bibtex
169
+ @misc{qin2025unicot,
170
+ title={Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision},
171
+ author={Luozheng Qin and Jia Gong and Yuqing Sun and Tianjiao Li and Mengping Yang and Xiaomeng Yang and Chao Qu and Zhiyu Tan and Hao Li},
172
+ year={2025},
173
+ eprint={2508.05606},
174
+ archivePrefix={arXiv},
175
+ primaryClass={cs.CV},
176
+ url={https://arxiv.org/abs/2508.05606},
177
+ }
178
+ ```
179
+
180
+ ## Acknowledgement
181
+
182
+ We acknowledge the contributions of the following projects that inspired and supported Uni-CoT:
183
+ - [Bagel](https://github.com/ByteDance-Seed/Bagel) proposed by ByteDance-Seed team.
184
+ - [WISE](https://github.com/PKU-YuanGroup/WISE) proposed by PKU-YuanGroup.
185
+ - [KRIS-Bench](https://github.com/mercurystraw/Kris_Bench) proposed by Stepfun.
186
+ - [RISE-Bench](https://github.com/PhoenixZ810/RISEBench) proposed by Shanghai AI Lab.