--- license: cc-by-nc-2.0 # 或者你选择的许可证，例如 mit, cc-by-sa-4.0 等 tags: - scene-text-synthesis - multilingual - diffusion - dit - ocr-free - textflux - flux # 如果你的模型基于FLUX # - text-to-image # 这是一个通用的计算机视觉标签 # - generated_image_text # 更具体的标签 library_name: diffusers # 因为你提到了 Diffusers pipeline_tag: text-to-image # 或者更具体的任务标签 base_model: - black-forest-labs/FLUX.1-Fill-dev # datasets: # 如果你愿意，可以列出主要的训练数据集，即使它们尚未公开发布 # - your-custom-training-dataset-name # metrics: # 如果你有评估指标 # - fid # - ocr_accuracy # model-index: # 这部分帮助Hugging Face更好地索引模型和其结果 # - name: TextFlux # 你的模型名称 # results: # - task: # type: text-to-image # 任务类型 # name: Scene Text Synthesis # 任务的具体名称 # dataset: # 评估用的数据集 # name: your-evaluation-dataset # type: scene_text_images # metrics: # 评估指标 # - name: OCR Accuracy # value: 90.5 # 举例 # type: ocr_accuracy # - name: FID # value: 30.2 # 举例 # type: fid --- # TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

**TextFlux** is an **OCR-free framework** using a Diffusion Transformer (DiT, based on [FLUX.1-Fill-dev](https://github.com/black-forest-labs/flux)) for high-fidelity multilingual scene text synthesis. It simplifies the learning task by providing direct visual glyph guidance through spatial concatenation of rendered glyphs with the scene image, enabling the model to focus on contextual reasoning and visual fusion. ## Key Features * **OCR-Free:** Simplified architecture without OCR encoders. * **High-Fidelity & Contextual Styles:** Precise rendering, stylistically consistent with scenes. * **Multilingual & Low-Resource:** Strong performance across languages, adapts to new languages with minimal data (e.g., <1,000 samples). * **Zero-Shot Generalization:** Renders characters unseen during training. * **Controllable Multi-Line Text:** Flexible multi-line synthesis with line-level control. * **Data Efficient:** Uses a fraction of data (e.g., ~1%) compared to other methods.

## Updates - **`2025/05/27`**: Our [**Full-Param Weights**](https://huggingface.co/yyyyyxie/textflux) and [**LoRA Weights**](https://huggingface.co/yyyyyxie/textflux-lora) are now available 🤗! - **`2025/05/25`**: Our [**Paper on ArXiv**](https://arxiv.org/abs/2505.17778) is available 🥳! ## Setup 1. **Clone/Download:** Get the necessary code and model weights. 2. **Dependencies:** ```bash conda create -n textflux python==3.11.4 -y conda activate textflux pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install -r requirements.txt # Ensure diffusers >= 0.32.1 ``` ## Gradio Demo Provides "Normal Mode" (for pre-combined inputs) and "Custom Mode" (upload scene, draw masks, input text for automatic template generation and concatenation). ```bash python demo.py ``` ## Acknowledgement Our code is modified based on [Diffusers](https://github.com/huggingface/diffusers). We adopt [black-forest-labs/FLUX.1-Fill-dev](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev) as the base model. Thanks to all the contributors for the helpful discussions! ## License The use of this model, TextFlux, is governed by the **FLUX.1 [dev] Non-Commercial License Agreement** (or the specific version applicable to FLUX.1-Fill-dev, upon which TextFlux is based). ## Citation ```bibtex @misc{xie2025textfluxocrfreeditmodel, title={TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis}, author={Yu Xie and Jielei Zhang and Pengyu Chen and Ziyue Wang and Weihang Wang and Longwen Gao and Peiyi Li and Huyang Sun and Qiang Zhang and Qian Qiao and Jiaqing Fan and Zhouhui Lian}, year={2025}, eprint={2505.17778}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.17778}, } ```