yyyyyxie commited on
Commit
a48fe64
·
verified ·
1 Parent(s): fbacf34

Update/Upload model card for LoRA

Browse files
Files changed (1) hide show
  1. README.md +261 -3
README.md CHANGED
@@ -1,3 +1,261 @@
1
- ---
2
- license: cc-by-nc-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-2.0
3
+ tags:
4
+ - scene-text-synthesis
5
+ - multilingual
6
+ - diffusion
7
+ - dit
8
+ - ocr-free
9
+ - textflux
10
+ - flux
11
+ # - text-to-image
12
+ # - generated_image_text
13
+ library_name: diffusers
14
+ pipeline_tag: text-to-image
15
+ base_model:
16
+ - black-forest-labs/FLUX.1-Fill-dev
17
+ ---
18
+
19
+ # TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis
20
+
21
+ <div style="display: flex; justify-content: center; align-items: center;">
22
+ <a href="https://arxiv.org/abs/2505.17778">
23
+ <img src='https://img.shields.io/badge/arXiv-2505.17778-red?style=flat&logo=arXiv&logoColor=red' alt='arxiv'>
24
+ </a>
25
+ <a href='https://huggingface.co/yyyyyxie/textflux'>
26
+ <img src='https://img.shields.io/badge/Hugging Face-ckpts-orange?style=flat&logo=HuggingFace&logoColor=orange' alt='huggingface'>
27
+ </a>
28
+ <a href="https://github.com/yyyyyxie/textflux">
29
+ <img src='https://img.shields.io/badge/GitHub-Repo-blue?style=flat&logo=GitHub' alt='GitHub'>
30
+ </a>
31
+ <a href="https://huggingface.co/yyyyyxie/textflux" style="margin: 0 2px;">
32
+ <img src='https://img.shields.io/badge/Demo-Gradio-gold?style=flat&logo=Gradio&logoColor=red' alt='Demo'>
33
+ </a>
34
+ <a href='https://yyyyyxie.github.io/textflux-site/'>
35
+ <img src='https://img.shields.io/badge/Webpage-Project-silver?style=flat&logo=&logoColor=orange' alt='webpage'>
36
+ </a>
37
+ <a href="https://modelscope.cn/models/xieyu20001003/textflux">
38
+ <img src="https://img.shields.io/badge/🤖_ModelScope-ckpts-ffbd45.svg" alt="ModelScope">
39
+ </a>
40
+ </div>
41
+ <p align="left">
42
+ <strong>English</strong> | <a href="./README_CN.md"><strong>中文简体</strong></a>
43
+ </p>
44
+
45
+ **TextFlux** is an **OCR-free framework** using a Diffusion Transformer (DiT, based on [FLUX.1-Fill-dev](https://github.com/black-forest-labs/flux)) for high-fidelity multilingual scene text synthesis. It simplifies the learning task by providing direct visual glyph guidance through spatial concatenation of rendered glyphs with the scene image, enabling the model to focus on contextual reasoning and visual fusion.
46
+
47
+ ## Key Features
48
+
49
+ * **OCR-Free:** Simplified architecture without OCR encoders.
50
+ * **High-Fidelity & Contextual Styles:** Precise rendering, stylistically consistent with scenes.
51
+ * **Multilingual & Low-Resource:** Strong performance across languages, adapts to new languages with minimal data (e.g., <1,000 samples).
52
+ * **Zero-Shot Generalization:** Renders characters unseen during training.
53
+ * **Controllable Multi-Line Text:** Flexible multi-line synthesis with line-level control.
54
+ * **Data Efficient:** Uses a fraction of data (e.g., ~1%) compared to other methods.
55
+
56
+ <div align="center">
57
+ <img src="https://image-transfer-season.oss-cn-qingdao.aliyuncs.com/pictures/abstract_fig.png" width="100%" height="100%"/>
58
+ </div>
59
+
60
+
61
+ ## Updates
62
+
63
+ - **`2025/08/02`**: Our full param [**TextFlux-beta**](https://huggingface.co/yyyyyxie/textflux-beta) weights and [**TextFlux-LoRA-beta**](https://huggingface.co/yyyyyxie/textflux-lora-beta) weights are now available! Single-line text generation accuracy performance could be significantly enhanced by **10.9%** and **11.2%** respectively 👋!
64
+ - **`2025/08/02`**: Our [**Training Datasets**](https://huggingface.co/datasets/yyyyyxie/textflux-anyword) and [**Testing Datasets**](https://huggingface.co/datasets/yyyyyxie/textflux-test-datasets) are now available 👋!
65
+ - **`2025/08/01`**: Our [**Eval Scripts**](https://huggingface.co/yyyyyxie/textflux) are now available 👋!
66
+ - **`2025/05/27`**: Our [**Full-Param Weights**](https://huggingface.co/yyyyyxie/textflux) and [**LoRA Weights**](https://huggingface.co/yyyyyxie/textflux-lora) are now available 👋!
67
+ - **`2025/05/25`**: Our [**Paper on ArXiv**](https://arxiv.org/abs/2505.17778) is available 👋!
68
+
69
+
70
+
71
+ ## TextFlux-beta
72
+
73
+ We are excited to release [**TextFlux-beta**](https://huggingface.co/yyyyyxie/textflux-beta) and [**TextFlux-LoRA-beta**](https://huggingface.co/yyyyyxie/textflux-lora-beta), new versions of our model specifically optimized for single-line text editing.
74
+
75
+ ### Key Advantages
76
+
77
+ - **Significantly improves the quality** of single-line text rendering.
78
+ - **Increases inference speed** for single-line text by approximately **1.4x**.
79
+ - **Dramatically enhances the accuracy** of small text synthesis.
80
+
81
+ ### How It Works
82
+
83
+ Considering that single-line editing is a primary use case for many users and generally yields more stable, high-quality results, we have released new weights optimized for this scenario.
84
+
85
+ Unlike the original model which renders glyphs onto a full-size mask, the beta version utilizes a **single-line image strip** for the glyph condition. This approach not only reduces unnecessary computational overhead but also provides a more stable and high-quality supervisory signal. This leads directly to the significant improvements in both single-line and small text rendering (see example [here](https://github.com/yyyyyxie/textflux/blob/main/resource/demo_singleline.png)).
86
+
87
+
88
+ To use these new models, please refer to the updated files: demo.py, run_inference.py, and run_inference_lora.py. While the beta models retain the ability to generate multi-line text, we **highly recommend** using them for single-line tasks to achieve the best performance and stability.
89
+
90
+ ### Performance
91
+
92
+ This table shows that the TextFlux-beta model achieves a significant performance improvement of approximately **11 points** in single-line text editing, while also boosting inference speed by **1.4 times** compared to previous versions! The [**AMO Sampler**](https://github.com/hxixixh/amo-release) contributed approximately 3 points to this increase. The test dataset is [**ReCTS editing**](https://huggingface.co/datasets/yyyyyxie/textflux-test-datasets).
93
+
94
+ | Method | SeqAcc-Editing (%)↑ | NED (%)↑ | FID ↓ | LPIPS ↓ | Inference Speed (s/img)↓ |
95
+ | ------------------ | :-----------------: | :------: | :------: | :-------: | :----------------------: |
96
+ | TextFlux-LoRA | 37.2 | 58.2 | 4.93 | 0.063 | 16.8 |
97
+ | TextFlux | 40.6 | 60.7 | 4.84 | 0.062 | 15.6 |
98
+ | TextFlux-LoRA-beta | 48.4 | 70.5 | 4.69 | 0.062 | 12.0 |
99
+ | TextFlux-beta | **51.5** | **72.9** | **4.59** | **0.061** | **10.9** |
100
+
101
+
102
+
103
+ ## Setup
104
+
105
+ 1. **Clone/Download:** Get the necessary code and model weights.
106
+
107
+
108
+ 2. **Dependencies:**
109
+
110
+ ```bash
111
+ git clone https://github.com/yyyyyxie/textflux.git
112
+ cd textflux
113
+ conda create -n textflux python==3.11.4 -y
114
+ conda activate textflux
115
+ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
116
+
117
+ pip install -r requirements.txt
118
+ cd diffusers
119
+ pip install -e .
120
+ # Ensure gradio == 3.50.1
121
+ ```
122
+
123
+
124
+
125
+ ## Gradio Demo
126
+
127
+ Provides "Custom Mode" (upload scene image, draw masks, input text for automatic template generation) and "Normal Mode" (for pre-combined inputs).
128
+
129
+ ```bash
130
+ # Ensure gradio == 3.50.1
131
+ python demo.py
132
+ ```
133
+
134
+
135
+
136
+ ## Training
137
+
138
+ This guide provides instructions for training and fine-tuning the **TextFlux** models.
139
+
140
+ -----
141
+
142
+ ### Multi-line Training (Reproducing Paper Results)
143
+
144
+ Follow these steps to reproduce the multi-line text generation results from the original paper.
145
+
146
+ 1. **Prepare the Dataset**
147
+ Download the [**Multi-line**](https://huggingface.co/datasets/yyyyyxie/textflux-multi-line) dataset and organize it using the following directory structure:
148
+
149
+ ```
150
+ |- ./datasets
151
+ |- multi-lingual
152
+ | |- processed_mlt2017
153
+ | |- processed_ReCTS_train_images
154
+ | |- processed_totaltext
155
+ | ....
156
+ ```
157
+
158
+ 2. **Run the Training Script**
159
+ Execute the appropriate training script. The `train.sh` script is for standard training, while `train_lora.sh` is for training with LoRA.
160
+
161
+ ```bash
162
+ # For standard training
163
+ bash scripts/train.sh
164
+ ```
165
+
166
+ or
167
+
168
+ ```bash
169
+ # For LoRA training
170
+ bash scripts/train_lora.sh
171
+ ```
172
+
173
+ *Note: Ensure you are using the commands and configurations within the script designated for **multi-line** training.*
174
+
175
+ -----
176
+
177
+
178
+ ### Single-line Training
179
+
180
+ To create our TextFlux beta weights optimized for the single-line task, we fine-tuned our pre-trained multi-line models. Specifically, we loaded the weights from the [**TextFlux**](https://huggingface.co/yyyyyxie/textflux) and [**TextFLux-LoRA**](https://huggingface.co/yyyyyxie/textflux-lora) models and continued training for an additional 10,000 steps on a single-line dataset.
181
+
182
+ If you wish to replicate this process, you can follow these steps:
183
+
184
+ 1. **Prepare the Dataset**
185
+ First, download the [**Single-line**](https://huggingface.co/datasets/yyyyyxie/textflux-anyword) dataset and arrange it as follows:
186
+
187
+ ```
188
+ |- ./datasets
189
+ |- anyword
190
+ | |- ReCTS
191
+ | |- TotalText
192
+ | |- ArT
193
+ | ...
194
+ ....
195
+ ```
196
+
197
+ 2. **Run the Fine-tuning Script**
198
+ Ensure your script is configured to load the weights from a pre-trained multi-line model, and then execute the fine-tuning command.
199
+
200
+ ```bash
201
+ # For standard fine-tuning
202
+ bash scripts/train.sh
203
+ ```
204
+
205
+ or
206
+
207
+ ```bash
208
+ # For LoRA fine-tuning
209
+ bash scripts/train_lora.sh
210
+ ```
211
+
212
+
213
+
214
+ ## Evaluation
215
+
216
+ First, use the `scripts/batch_eval.sh` script to perform batch inference on the images in the test set.
217
+
218
+ ```
219
+ bash scripts/batch_eval.sh
220
+ ```
221
+
222
+ Once inference is complete, use `eval/eval_ocr.sh` to evaluate the OCR accuracy and `eval/eval_fid_lpips.sh` to evaluate FID and LPIPS scores.
223
+
224
+ ```
225
+ bash eval/eval_ocr.sh
226
+ ```
227
+
228
+ ```
229
+ bash eval/eval_fid_lpips.sh
230
+ ```
231
+
232
+
233
+
234
+ ## TODO
235
+
236
+ - [x] Release the training datasets and testing datasets
237
+ - [x] Release the training scripts
238
+ - [x] Release the eval scripts
239
+ - [ ] Support comfyui
240
+
241
+
242
+
243
+ ## Acknowledgement
244
+
245
+ Our code is modified based on [Diffusers](https://github.com/huggingface/diffusers). We adopt [FLUX.1-Fill-dev](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev) as the base model. Thanks to all the contributors for the helpful discussions! We also sincerely thank the contributors of the following code repositories for their valuable contributions: [AnyText](https://github.com/tyxsspa/AnyText), [AMO](https://github.com/hxixixh/amo-release).
246
+
247
+
248
+
249
+ ## Citation
250
+
251
+ ```bibtex
252
+ @misc{xie2025textfluxocrfreeditmodel,
253
+ title={TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis},
254
+ author={Yu Xie and Jielei Zhang and Pengyu Chen and Ziyue Wang and Weihang Wang and Longwen Gao and Peiyi Li and Huyang Sun and Qiang Zhang and Qian Qiao and Jiaqing Fan and Zhouhui Lian},
255
+ year={2025},
256
+ eprint={2505.17778},
257
+ archivePrefix={arXiv},
258
+ primaryClass={cs.CV},
259
+ url={https://arxiv.org/abs/2505.17778},
260
+ }
261
+ ```