RaCig: A RAG-based Character-Consistent Story Image Generation Model

1. Multi-charater image generation with rich motion

Teaser Image

2. Model structure preview

Model Structure

πŸ“– Overview

RaCig is designed to generate images based on textual prompts and reference images for characters (referred to as "Characters"). It leverages several models and techniques, including:

  • Text-to-image retrieval (using CLIP)
  • IP-Adapter for incorporating reference image features (face and body/clothes)
  • ControlNet for pose/skeleton guidance
  • Action Direction DINO for action direction recognition
  • A pipeline (RaCigPipeline) to orchestrate the generation process.

The pipeline can handle multiple characters ("Characters") in a single scene, defined by their names, gender, and reference images (face and clothes).

πŸ“¦ Installation

  1. Clone the repository:

    git clone https://github.com/ZulutionAI/RaCig.git
    cd RaCig
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download necessary models and retrieval datasets:

    Models: https://huggingface.co/ZuluVision/RaCig

    Put the models under checkpoint as follow:

    ./models/
    β”œβ”€β”€ action_direction_dino/
    β”‚   └── checkpoint_best_regular.pth
    β”œβ”€β”€ controlnet/
    β”‚   └── model.safetensors
    β”œβ”€β”€ image_encoder/
    β”‚   β”œβ”€β”€ config.json
    β”‚   β”œβ”€β”€ model.safetensors
    β”‚   └── pytorch_model.bin
    β”œβ”€β”€ ipa_weights/
    β”‚   β”œβ”€β”€ ip-adapter-plus-face_sdxl_vit-h.bin
    β”‚   └── ip-adapter-plus_sdxl_vit-h.bin
    └── sdxl/
        └── dreamshaper.safetensors
    

    Retrieval datasets: https://huggingface.co/datasets/ZuluVision/RaCig-Data

    ./data
    β”œβ”€β”€ MSDBv2_v7
    β”œβ”€β”€ Reelshot_retrieval
    └── retrieve_info
    

πŸ’» Usage

Inference

  1. Run Inference:
    python inference.py
    
  2. Generated images, retrieved images, and skeleton visualizations will be saved in the output/ directory by default. Β·

Gradio

python run_gradio.py

For more detailed instruction, see Gradio Interface Instructions (EN) or Gradio Interface Instructions (δΈ­ζ–‡)

πŸ› οΈ Training

  1. We only train the controlnet, to make it recognize the feature map better. (The fused feature map after injecting IP information is quite hard for controlnet to constrain the pose, so we slightly finetune the controlnet)

  2. We use the retrieval dataset to finetune it. The dataset structure is organized as above.

bash train.sh

🀝 Contributing

❀️ Acknowledgements

This project is based on the work of the following open-source projects and contributors:

Downloads last month
0
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ZuluVision/RaCig

Finetuned
(1166)
this model