Spaces:
Running
on
Zero
Running
on
Zero
Update README.md
Browse files
README.md
CHANGED
@@ -1,239 +1,10 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
> <br>
|
12 |
-
|
13 |
-
<a href="https://arxiv.org/abs/2502.01572"><img src="https://img.shields.io/badge/ariXv-2411.15098-A42C25.svg" alt="arXiv"></a>
|
14 |
-
<a href="https://huggingface.co/showlab/makeanything"><img src="https://img.shields.io/badge/🤗_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a>
|
15 |
-
<a href="https://huggingface.co/datasets/showlab/makeanything/"><img src="https://img.shields.io/badge/🤗_HuggingFace-Dataset-ffbd45.svg" alt="HuggingFace"></a>
|
16 |
-
|
17 |
-
<br>
|
18 |
-
|
19 |
-
<img src='./images/teaser.png' width='100%' />
|
20 |
-
|
21 |
-
|
22 |
-
## Configuration
|
23 |
-
### 1. **Environment setup**
|
24 |
-
```bash
|
25 |
-
git clone https://github.com/showlab/MakeAnything.git
|
26 |
-
cd MakeAnything
|
27 |
-
|
28 |
-
conda create -n makeanything python=3.11.10
|
29 |
-
conda activate makeanything
|
30 |
-
```
|
31 |
-
### 2. **Requirements installation**
|
32 |
-
```bash
|
33 |
-
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
|
34 |
-
pip install --upgrade -r requirements.txt
|
35 |
-
|
36 |
-
accelerate config
|
37 |
-
```
|
38 |
-
|
39 |
-
## Asymmetric LoRA
|
40 |
-
### 1. Weights
|
41 |
-
You can download the trained checkpoints of Asymmetric LoRA & LoRA for inference. Below are the details of available models:
|
42 |
-
|
43 |
-
| **Model** | **Description** | **Resolution** |
|
44 |
-
|:-:|:-:|:-:|
|
45 |
-
| [asylora_9f_general](https://huggingface.co/showlab/makeanything/blob/main/asymmetric_lora/asymmetric_lora_9f_general.safetensors) | The Asymmetric LoRA has been fine-tuned on all 9-frames datasets. *Index of lora_up*: `1:LEGO` `2:Cook` `3:Painting` `4:Icon` `5:Landscape illustration` `6:Portrait` `7:Transformer` `8:Sand art` `9:Illustration` `10:Sketch` | 1056,1056 |
|
46 |
-
| [asylora_4f_general](https://huggingface.co/showlab/makeanything/blob/main/asymmetric_lora/asymmetric_lora_4f_general.safetensors) | The Asymmetric LoRA has been fine-tuned on all 4-frames datasets. *Index of lora_up: (1~10 same as 9f)* `11:Clay toys` `12:Clay sculpture` `13:Zbrush Modeling` `14:Wood sculpture` `15:Ink painting` `16:Pencil sketch` `17:Fabric toys` `18:Oil painting` `19:Jade Carving` `20:Line draw` `21:Emoji` | 1024,1024 |
|
47 |
-
|
48 |
-
### 2. Training
|
49 |
-
<span id="dataset_setting"></span>
|
50 |
-
#### 2.1 Settings for dataset
|
51 |
-
The training process relies on paired dataset consisting of text captions and images. Each dataset folder contains both `.caption` and `.png` files, where the filenames of the caption files correspond directly to the image filenames. Here is an example of the organized dataset.
|
52 |
-
|
53 |
-
```
|
54 |
-
dataset/
|
55 |
-
├── portrait_001.png
|
56 |
-
├── portrait_001.caption
|
57 |
-
├── portrait_002.png
|
58 |
-
├── portrait_002.caption
|
59 |
-
├── lego_001.png
|
60 |
-
├── lego_001.caption
|
61 |
-
```
|
62 |
-
|
63 |
-
The `.caption` files contain a **single line** of text that serves as a prompt for generating the corresponding image. The prompt **must specify the index of the lora_up** used for that particular training sample in the Asymmetric LoRA. The format for this is `--lora_up <index>`, where `<index>` is the current B matrices index in the Asymmetric LoRA, refers to the certain domain used in the training, and index should **start from 1**, not 0.
|
64 |
-
|
65 |
-
For example, a .caption file for a portrait painting sequence might look as follows:
|
66 |
-
|
67 |
-
```caption
|
68 |
-
3*3 of 9 sub-images, step-by-step portrait painting process, 1 girl --lora_up 6
|
69 |
-
```
|
70 |
-
|
71 |
-
Then, you should organize your **dataset configuration file** written in `TOML`. Here is an example:
|
72 |
-
|
73 |
-
```toml
|
74 |
-
[general]
|
75 |
-
enable_bucket = false
|
76 |
-
|
77 |
-
[[datasets]]
|
78 |
-
resolution = 1056
|
79 |
-
batch_size = 1
|
80 |
-
|
81 |
-
[[datasets.subsets]]
|
82 |
-
image_dir = '/path/to/dataset/'
|
83 |
-
caption_extension = '.caption'
|
84 |
-
num_repeats = 1
|
85 |
-
```
|
86 |
-
|
87 |
-
It is recommended to set batch size to 1 and set resolution to 1024 (4-frames) or 1056 (9-frames).
|
88 |
-
|
89 |
-
#### 2.2 Start training
|
90 |
-
We have provided a template file for training Asymmetric LoRA in `scripts/asylora_train.sh`. Simply replace corresponding paths with yours to start the training. Note that `lora_ups_num` in the script is the total number of B matrices used in Asymmetric LoRA that you specified during training.
|
91 |
-
|
92 |
-
```bash
|
93 |
-
chmod +x scripts/asylora_train.sh
|
94 |
-
scripts/asylora_train.sh
|
95 |
-
```
|
96 |
-
|
97 |
-
Additionally, if you are directly **using our dataset for training**, the `.caption` files in our released dataset do not specify the `--lora_up <index>` field. You will need to organize and update the `.caption` files to include the appropriate `--lora_up <index>` values before starting the training.
|
98 |
-
|
99 |
-
### 3. Inference
|
100 |
-
We have also provided a template file for inference Asymmetric LoRA in `scripts/asylora_inference.sh`. Once the training is done, replace file paths, fill in your prompt and run inference. Note that `lora_up_cur` in the script is the current number of B matrices index to be used for inference.
|
101 |
-
|
102 |
-
```bash
|
103 |
-
chmod +x scripts/asylora_inference.sh
|
104 |
-
scripts/asylora_train.sh
|
105 |
-
```
|
106 |
-
|
107 |
-
|
108 |
-
## Recraft Model
|
109 |
-
### 1. Weights
|
110 |
-
You can download the trained checkpoints of Recraft Model for inference. Below are the details of available models:
|
111 |
-
| **Model** | **Description** | **Resolution** |
|
112 |
-
|:-:|:-:|:-:|
|
113 |
-
| [recraft_9f_lego ](https://huggingface.co/showlab/makeanything/blob/main/recraft/recraft_9f_lego.safetensors) | The Recraft Model has been trained on `LEGO` dataset. Support `9-frames` generation. | 1056,1056 |
|
114 |
-
| [recraft_9f_portrait ](https://huggingface.co/showlab/makeanything/blob/main/recraft/recraft_9f_portrait.safetensors) | The Recraft Model has been trained on `Portrait` dataset. Support `9-frames` generation. | 1056,1056 |
|
115 |
-
| [recraft_9f_sketch ](https://huggingface.co/showlab/makeanything/blob/main/recraft/recraft_9f_sketch.safetensors) | The Recraft Model has been trained on `Sketch` dataset. Support `9-frames` generation. | 1056,1056 |
|
116 |
-
| [recraft_4f_wood_sculpture ](https://huggingface.co/showlab/makeanything/blob/main/recraft/recraft_4f_wood_sculpture.safetensors) | The Recraft Model has been trained on `Wood sculpture` dataset. Support `4-frames` generation. | 1024,1024 |
|
117 |
-
|
118 |
-
### 2. Training
|
119 |
-
#### 2.1 Obtain standard LoRA
|
120 |
-
During the second phase of training the image-to-sequence generation with the Recraft model, we need to apply a **standard LoRA architecture** to be merged to flux.1 before performing the Recraft training. Therefore, the first step is to decompose the Asymmetric LoRA into the original LoRA format.
|
121 |
-
|
122 |
-
To achieve this, **train a standard LoRA directly** (optional method below) or we have provided a script template in `scripts/asylora_split.sh` for **splitting the Asymmetric LoRA**. The script allows you to extract the required B matrices from the Asymmetric LoRA model. Specifically, the `LORA_UP` in the script specifies the index of the B matrices you wish to extract for use as the original LoRA.
|
123 |
-
|
124 |
-
```bash
|
125 |
-
chmod +x scripts/asylora_split.sh
|
126 |
-
scripts/asylora_split.sh
|
127 |
-
```
|
128 |
-
|
129 |
-
#### (Optional) Train standard LoRA
|
130 |
-
You can also **directly train a standard LoRA** for Recraft process, eliminating the need to decompose the Asymmetric LoRA. In our project, we have included the standard LoRA training code from [kohya-ss/sd-scripts](https://github.com/sd-scripts) in the files `flux_train_network.py` for training and `flux_minimal_inference.py` for inference. You can refer to the related documentation for guidance on how to train.
|
131 |
-
|
132 |
-
Alternatively, using other training platforms like [kijai/ComfyUI-FluxTrainer](https://github.com/ComfyUI-FluxTrainer) is also a viable option. These platforms provide tools to facilitate the training and inference of LoRA models for the Recraft process.
|
133 |
-
|
134 |
-
#### 2.2 Merge LoRA to flux.1
|
135 |
-
Now you have obtained a standard LoRA, use our `scripts/lora_merge.sh` template script to merge the LoRA to flux.1 checkpoints for further recraft training. Note that the merged model may take up **around 50GB** of your memory space.
|
136 |
-
|
137 |
-
```bash
|
138 |
-
chmod +x scripts/lora_merge.sh
|
139 |
-
scripts/lora_merge.sh
|
140 |
-
```
|
141 |
-
#### 2.3 Settings for training
|
142 |
-
|
143 |
-
The dataset structure for Recraft training follows the same organization format as the dataset for Asymmetric LoRA, specifically described in [Asymmetric LoRA 2.1 Settings for dataset](#dataset_setting). A `TOML` configuration file is also required to organize and configure the dataset. Below is a template for the dataset configuration file:
|
144 |
-
|
145 |
-
```toml
|
146 |
-
[general]
|
147 |
-
flip_aug = false
|
148 |
-
color_aug = false
|
149 |
-
keep_tokens_separator = "|||"
|
150 |
-
shuffle_caption = false
|
151 |
-
caption_tag_dropout_rate = 0
|
152 |
-
caption_extension = ".caption"
|
153 |
-
|
154 |
-
[[datasets]]
|
155 |
-
batch_size = 1
|
156 |
-
enable_bucket = true
|
157 |
-
resolution = [1024, 1024]
|
158 |
-
|
159 |
-
[[datasets.subsets]]
|
160 |
-
image_dir = "/path/to/dataset/"
|
161 |
-
num_repeats = 1
|
162 |
-
```
|
163 |
-
|
164 |
-
Note that for training with 4-frame step sequences, the resolution must be set to `1024`. For training with 9-frame steps, the resolution should be `1056`.
|
165 |
-
|
166 |
-
For the sampling phase of the Recraft training process, we need to organize two text files: `sample_images.txt` and `sample_prompts.txt`. These files will store the sampled condition images and their corresponding prompts, respectively. Below are the templates for both files:
|
167 |
-
|
168 |
-
**sample_images.txt**
|
169 |
-
```txt
|
170 |
-
/path/to/image_1.png
|
171 |
-
/path/to/image_2.png
|
172 |
-
```
|
173 |
-
|
174 |
-
**sample_prompts.txt**
|
175 |
-
```txt
|
176 |
-
image_1_prompt_content
|
177 |
-
image_2_prompt_content
|
178 |
-
```
|
179 |
-
#### 2.4 Recraft training
|
180 |
-
We have provided a template file for training Recraft Model in `scripts/recraft_train.sh`. Simply replace corresponding paths with yours to start the training. Note that `frame_num` in the script must be `4` (for 1024 resolution) or `9` (for 1056 resolution).
|
181 |
-
|
182 |
-
```bash
|
183 |
-
chmod +x scripts/asylora_train.sh
|
184 |
-
scripts/asylora_train.sh
|
185 |
-
```
|
186 |
-
|
187 |
-
### 3. Inference
|
188 |
-
We have also provided a template file for inference Recraft Model in `scripts/recraft_inference.sh`. Once the training is done, replace file paths, fill in your prompt and run inference.
|
189 |
-
|
190 |
-
```bash
|
191 |
-
chmod +x scripts/asylora_inference.sh
|
192 |
-
scripts/asylora_train.sh
|
193 |
-
```
|
194 |
-
|
195 |
-
## Datasets
|
196 |
-
|
197 |
-
We have uploaded our datasets on [Hugging Face](https://huggingface.co/datasets/showlab/makeanything/). The datasets includes both 4-frame and 9-frame sequence images, covering a total of 21 domains of procedural sequences. For MakeAnything training, each domain consists of **50 sequences**, with resolutions of either **1024 (4-frame)** or **1056 (9-frame)**. Additionally, we provide an extensive collection of SVG datasets and Sketch datasets for further research and experimentation.
|
198 |
-
|
199 |
-
Note that the arrangement of **9-frame sequences follows an S-shape pattern**, whereas **4-frame sequences follow a ɔ-shape pattern**.
|
200 |
-
|
201 |
-
<details>
|
202 |
-
<summary>Click to preview the datasets</summary>
|
203 |
-
<br>
|
204 |
-
|
205 |
-
| Domain | Preview | Quantity | Domain | Preview | Quantity |
|
206 |
-
|:--------:|:---------:|:----------:|:--------:|:---------:|:----------:|
|
207 |
-
| LEGO |  | 50 | Cook |  | 50 |
|
208 |
-
| Painting |  | 50 | Icon |  | 50+1.4k |
|
209 |
-
| Landscape Illustration |  | 50 | Portrait |  | 50+2k |
|
210 |
-
| Transformer |  | 50 | Sand Art |  | 50 |
|
211 |
-
| Illustration |  | 50 | Sketch |  | 50+9k |
|
212 |
-
| Clay Toys |  | 50 | Clay Sculpture |  | 50 |
|
213 |
-
| ZBrush Modeling |  | 50 | Wood Sculpture |  | 50 |
|
214 |
-
| Ink Painting |  | 50 | Pencil Sketch |  | 50 |
|
215 |
-
| Fabric Toys |  | 50 | Oil Painting |  | 50 |
|
216 |
-
| Jade Carving |  | 50 | Line Draw |  | 50 |
|
217 |
-
| Emoji |  | 50+12k | | | |
|
218 |
-
|
219 |
-
</details>
|
220 |
-
|
221 |
-
## Results
|
222 |
-
### Text-to-Sequence Generation (LoRA & Asymmetric LoRA)
|
223 |
-
<img src='./images/t2i.png' width='100%' />
|
224 |
-
|
225 |
-
### Image-to-Sequence Generation (Recraft Model)
|
226 |
-
<img src='./images/i2i.png' width='100%' />
|
227 |
-
|
228 |
-
### Generalization on Unseen Domains
|
229 |
-
<img src='./images/oneshot.png' width='100%' />
|
230 |
-
|
231 |
-
## Citation
|
232 |
-
```
|
233 |
-
@inproceedings{Song2025MakeAnythingHD,
|
234 |
-
title={MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation},
|
235 |
-
author={Yiren Song and Cheng Liu and Mike Zheng Shou},
|
236 |
-
year={2025},
|
237 |
-
url={https://api.semanticscholar.org/CorpusID:276107845}
|
238 |
-
}
|
239 |
-
```
|
|
|
1 |
+
---
|
2 |
+
title: "MakeAnything"
|
3 |
+
emoji: "🤖"
|
4 |
+
colorFrom: "red"
|
5 |
+
colorTo: "yellow"
|
6 |
+
sdk: "gradio"
|
7 |
+
sdk_version: "3.6"
|
8 |
+
app_file: gradio_app.py
|
9 |
+
pinned: false
|
10 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|