File size: 3,660 Bytes
b9e607d 18c2c92 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
license: apache-2.0
language:
- en
base_model:
- stabilityai/stable-diffusion-xl-base-1.0
library_name: diffusers
---
<div align="center">
<h1>RaCig: A RAG-based Character-Consistent Story Image Generation Model</h1>
<a href='https://huggingface.co/ZuluVision/RaCig'><img src='https://img.shields.io/badge/π€%20Hugging%20Face-Model-blue'></a>
<a href='https://pan.baidu.com/s/1Vt2meAg5DkjUXktY_H6eNg?pwd=ympj'><img src='https://img.shields.io/badge/Baidu_Netdisk-Dataset-green?logo=baidu'></a>
</div>
### 1. Multi-charater image generation with rich motion
<div align="center">
<img src="assets/teaser.png" alt="Teaser Image" width="700"/>
</div>
### 2. Model structure preview
<div align="center">
<img src="assets/model_structure.png" alt="Model Structure" width="700"/>
</div>
## π Overview
RaCig is designed to generate images based on textual prompts and reference images for characters (referred to as "Characters"). It leverages several models and techniques, including:
* Text-to-image retrieval (using CLIP)
* IP-Adapter for incorporating reference image features (face and body/clothes)
* ControlNet for pose/skeleton guidance
* Action Direction DINO for action direction recognition
* A pipeline (`RaCigPipeline`) to orchestrate the generation process.
The pipeline can handle multiple characters ("Characters") in a single scene, defined by their names, gender, and reference images (face and clothes).
## π¦ Installation
1. **Clone the repository:**
```bash
git clone https://github.com/ZulutionAI/RaCig
cd RaCig
```
2. **Install dependencies:**
```bash
pip install -r requirements.txt
```
3. **Download necessary models and retrieval datasets:**
Models: https://huggingface.co/ZuluVision/RaCig
Put the models under checkpoint as follow:
```
./models/
βββ action_direction_dino/
β βββ checkpoint_best_regular.pth
βββ controlnet/
β βββ model.safetensors
βββ image_encoder/
β βββ config.json
β βββ model.safetensors
β βββ pytorch_model.bin
βββ ipa_weights/
β βββ ip-adapter-plus-face_sdxl_vit-h.bin
β βββ ip-adapter-plus_sdxl_vit-h.bin
βββ sdxl/
βββ dreamshaper.safetensors
```
Retrieval datasets: https://pan.baidu.com/s/1Vt2meAg5DkjUXktY_H6eNg?pwd=ympj
```
./data
βββ MSDBv2_v7
βββ Reelshot_retrieval
βββ retrieve_info
```
## π» Usage
### Inference
1. **Run Inference:**
```python
python inference.py
```
2. Generated images, retrieved images, and skeleton visualizations will be saved in the `output/` directory by default.
Β·
### Gradio
```python
python run_gradio.py
```
For more detailed instruction, see [Gradio Interface Instructions (EN)](docs/gradio_instruction_en.md) or [Gradio Interface Instructions (δΈζ)](docs/gradio_instruction_cn.md)
## π οΈ Training
1. We only train the controlnet, to make it recognize the feature map better. (The fused feature map after injecting IP information is quite hard for controlnet to constrain the pose, so we slightly finetune the controlnet)
2. We use the retrieval dataset to finetune it. The dataset structure is organized as above.
```bash
bash train.sh
```
## π€ Contributing
## β€οΈ Acknowledgements
This project is based on the work of the following open-source projects and contributors:
* [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter) - Image Prompt Adapter developed by Tencent AI Lab
* [xiaohu2015](https://github.com/xiaohu2015)
|