yiren98 commited on
Commit
12ae7b3
·
verified ·
1 Parent(s): 36ed92b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -239
README.md CHANGED
@@ -1,239 +1,10 @@
1
- # MakeAnything
2
-
3
- > **MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation**
4
- > <br>
5
- > [Yiren Song](https://scholar.google.com.hk/citations?user=L2YS0jgAAAAJ),
6
- > [Cheng Liu](https://scholar.google.com.hk/citations?hl=zh-CN&user=TvdVuAYAAAAJ),
7
- > and
8
- > [Mike Zheng Shou](https://sites.google.com/view/showlab)
9
- > <br>
10
- > [Show Lab](https://sites.google.com/view/showlab), National University of Singapore
11
- > <br>
12
-
13
- <a href="https://arxiv.org/abs/2502.01572"><img src="https://img.shields.io/badge/ariXv-2411.15098-A42C25.svg" alt="arXiv"></a>
14
- <a href="https://huggingface.co/showlab/makeanything"><img src="https://img.shields.io/badge/🤗_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a>
15
- <a href="https://huggingface.co/datasets/showlab/makeanything/"><img src="https://img.shields.io/badge/🤗_HuggingFace-Dataset-ffbd45.svg" alt="HuggingFace"></a>
16
-
17
- <br>
18
-
19
- <img src='./images/teaser.png' width='100%' />
20
-
21
-
22
- ## Configuration
23
- ### 1. **Environment setup**
24
- ```bash
25
- git clone https://github.com/showlab/MakeAnything.git
26
- cd MakeAnything
27
-
28
- conda create -n makeanything python=3.11.10
29
- conda activate makeanything
30
- ```
31
- ### 2. **Requirements installation**
32
- ```bash
33
- pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
34
- pip install --upgrade -r requirements.txt
35
-
36
- accelerate config
37
- ```
38
-
39
- ## Asymmetric LoRA
40
- ### 1. Weights
41
- You can download the trained checkpoints of Asymmetric LoRA & LoRA for inference. Below are the details of available models:
42
-
43
- | **Model** | **Description** | **Resolution** |
44
- |:-:|:-:|:-:|
45
- | [asylora_9f_general](https://huggingface.co/showlab/makeanything/blob/main/asymmetric_lora/asymmetric_lora_9f_general.safetensors) | The Asymmetric LoRA has been fine-tuned on all 9-frames datasets. *Index of lora_up*: `1:LEGO` `2:Cook` `3:Painting` `4:Icon` `5:Landscape illustration` `6:Portrait` `7:Transformer` `8:Sand art` `9:Illustration` `10:Sketch` | 1056,1056 |
46
- | [asylora_4f_general](https://huggingface.co/showlab/makeanything/blob/main/asymmetric_lora/asymmetric_lora_4f_general.safetensors) | The Asymmetric LoRA has been fine-tuned on all 4-frames datasets. *Index of lora_up: (1~10 same as 9f)* `11:Clay toys` `12:Clay sculpture` `13:Zbrush Modeling` `14:Wood sculpture` `15:Ink painting` `16:Pencil sketch` `17:Fabric toys` `18:Oil painting` `19:Jade Carving` `20:Line draw` `21:Emoji` | 1024,1024 |
47
-
48
- ### 2. Training
49
- <span id="dataset_setting"></span>
50
- #### 2.1 Settings for dataset
51
- The training process relies on paired dataset consisting of text captions and images. Each dataset folder contains both `.caption` and `.png` files, where the filenames of the caption files correspond directly to the image filenames. Here is an example of the organized dataset.
52
-
53
- ```
54
- dataset/
55
- ├── portrait_001.png
56
- ├── portrait_001.caption
57
- ├── portrait_002.png
58
- ├── portrait_002.caption
59
- ├── lego_001.png
60
- ├── lego_001.caption
61
- ```
62
-
63
- The `.caption` files contain a **single line** of text that serves as a prompt for generating the corresponding image. The prompt **must specify the index of the lora_up** used for that particular training sample in the Asymmetric LoRA. The format for this is `--lora_up <index>`, where `<index>` is the current B matrices index in the Asymmetric LoRA, refers to the certain domain used in the training, and index should **start from 1**, not 0.
64
-
65
- For example, a .caption file for a portrait painting sequence might look as follows:
66
-
67
- ```caption
68
- 3*3 of 9 sub-images, step-by-step portrait painting process, 1 girl --lora_up 6
69
- ```
70
-
71
- Then, you should organize your **dataset configuration file** written in `TOML`. Here is an example:
72
-
73
- ```toml
74
- [general]
75
- enable_bucket = false
76
-
77
- [[datasets]]
78
- resolution = 1056
79
- batch_size = 1
80
-
81
- [[datasets.subsets]]
82
- image_dir = '/path/to/dataset/'
83
- caption_extension = '.caption'
84
- num_repeats = 1
85
- ```
86
-
87
- It is recommended to set batch size to 1 and set resolution to 1024 (4-frames) or 1056 (9-frames).
88
-
89
- #### 2.2 Start training
90
- We have provided a template file for training Asymmetric LoRA in `scripts/asylora_train.sh`. Simply replace corresponding paths with yours to start the training. Note that `lora_ups_num` in the script is the total number of B matrices used in Asymmetric LoRA that you specified during training.
91
-
92
- ```bash
93
- chmod +x scripts/asylora_train.sh
94
- scripts/asylora_train.sh
95
- ```
96
-
97
- Additionally, if you are directly **using our dataset for training**, the `.caption` files in our released dataset do not specify the `--lora_up <index>` field. You will need to organize and update the `.caption` files to include the appropriate `--lora_up <index>` values before starting the training.
98
-
99
- ### 3. Inference
100
- We have also provided a template file for inference Asymmetric LoRA in `scripts/asylora_inference.sh`. Once the training is done, replace file paths, fill in your prompt and run inference. Note that `lora_up_cur` in the script is the current number of B matrices index to be used for inference.
101
-
102
- ```bash
103
- chmod +x scripts/asylora_inference.sh
104
- scripts/asylora_train.sh
105
- ```
106
-
107
-
108
- ## Recraft Model
109
- ### 1. Weights
110
- You can download the trained checkpoints of Recraft Model for inference. Below are the details of available models:
111
- | **Model** | **Description** | **Resolution** |
112
- |:-:|:-:|:-:|
113
- | [recraft_9f_lego ](https://huggingface.co/showlab/makeanything/blob/main/recraft/recraft_9f_lego.safetensors) | The Recraft Model has been trained on `LEGO` dataset. Support `9-frames` generation. | 1056,1056 |
114
- | [recraft_9f_portrait ](https://huggingface.co/showlab/makeanything/blob/main/recraft/recraft_9f_portrait.safetensors) | The Recraft Model has been trained on `Portrait` dataset. Support `9-frames` generation. | 1056,1056 |
115
- | [recraft_9f_sketch ](https://huggingface.co/showlab/makeanything/blob/main/recraft/recraft_9f_sketch.safetensors) | The Recraft Model has been trained on `Sketch` dataset. Support `9-frames` generation. | 1056,1056 |
116
- | [recraft_4f_wood_sculpture ](https://huggingface.co/showlab/makeanything/blob/main/recraft/recraft_4f_wood_sculpture.safetensors) | The Recraft Model has been trained on `Wood sculpture` dataset. Support `4-frames` generation. | 1024,1024 |
117
-
118
- ### 2. Training
119
- #### 2.1 Obtain standard LoRA
120
- During the second phase of training the image-to-sequence generation with the Recraft model, we need to apply a **standard LoRA architecture** to be merged to flux.1 before performing the Recraft training. Therefore, the first step is to decompose the Asymmetric LoRA into the original LoRA format.
121
-
122
- To achieve this, **train a standard LoRA directly** (optional method below) or we have provided a script template in `scripts/asylora_split.sh` for **splitting the Asymmetric LoRA**. The script allows you to extract the required B matrices from the Asymmetric LoRA model. Specifically, the `LORA_UP` in the script specifies the index of the B matrices you wish to extract for use as the original LoRA.
123
-
124
- ```bash
125
- chmod +x scripts/asylora_split.sh
126
- scripts/asylora_split.sh
127
- ```
128
-
129
- #### (Optional) Train standard LoRA
130
- You can also **directly train a standard LoRA** for Recraft process, eliminating the need to decompose the Asymmetric LoRA. In our project, we have included the standard LoRA training code from [kohya-ss/sd-scripts](https://github.com/sd-scripts) in the files `flux_train_network.py` for training and `flux_minimal_inference.py` for inference. You can refer to the related documentation for guidance on how to train.
131
-
132
- Alternatively, using other training platforms like [kijai/ComfyUI-FluxTrainer](https://github.com/ComfyUI-FluxTrainer) is also a viable option. These platforms provide tools to facilitate the training and inference of LoRA models for the Recraft process.
133
-
134
- #### 2.2 Merge LoRA to flux.1
135
- Now you have obtained a standard LoRA, use our `scripts/lora_merge.sh` template script to merge the LoRA to flux.1 checkpoints for further recraft training. Note that the merged model may take up **around 50GB** of your memory space.
136
-
137
- ```bash
138
- chmod +x scripts/lora_merge.sh
139
- scripts/lora_merge.sh
140
- ```
141
- #### 2.3 Settings for training
142
-
143
- The dataset structure for Recraft training follows the same organization format as the dataset for Asymmetric LoRA, specifically described in [Asymmetric LoRA 2.1 Settings for dataset](#dataset_setting). A `TOML` configuration file is also required to organize and configure the dataset. Below is a template for the dataset configuration file:
144
-
145
- ```toml
146
- [general]
147
- flip_aug = false
148
- color_aug = false
149
- keep_tokens_separator = "|||"
150
- shuffle_caption = false
151
- caption_tag_dropout_rate = 0
152
- caption_extension = ".caption"
153
-
154
- [[datasets]]
155
- batch_size = 1
156
- enable_bucket = true
157
- resolution = [1024, 1024]
158
-
159
- [[datasets.subsets]]
160
- image_dir = "/path/to/dataset/"
161
- num_repeats = 1
162
- ```
163
-
164
- Note that for training with 4-frame step sequences, the resolution must be set to `1024`. For training with 9-frame steps, the resolution should be `1056`.
165
-
166
- For the sampling phase of the Recraft training process, we need to organize two text files: `sample_images.txt` and `sample_prompts.txt`. These files will store the sampled condition images and their corresponding prompts, respectively. Below are the templates for both files:
167
-
168
- **sample_images.txt**
169
- ```txt
170
- /path/to/image_1.png
171
- /path/to/image_2.png
172
- ```
173
-
174
- **sample_prompts.txt**
175
- ```txt
176
- image_1_prompt_content
177
- image_2_prompt_content
178
- ```
179
- #### 2.4 Recraft training
180
- We have provided a template file for training Recraft Model in `scripts/recraft_train.sh`. Simply replace corresponding paths with yours to start the training. Note that `frame_num` in the script must be `4` (for 1024 resolution) or `9` (for 1056 resolution).
181
-
182
- ```bash
183
- chmod +x scripts/asylora_train.sh
184
- scripts/asylora_train.sh
185
- ```
186
-
187
- ### 3. Inference
188
- We have also provided a template file for inference Recraft Model in `scripts/recraft_inference.sh`. Once the training is done, replace file paths, fill in your prompt and run inference.
189
-
190
- ```bash
191
- chmod +x scripts/asylora_inference.sh
192
- scripts/asylora_train.sh
193
- ```
194
-
195
- ## Datasets
196
-
197
- We have uploaded our datasets on [Hugging Face](https://huggingface.co/datasets/showlab/makeanything/). The datasets includes both 4-frame and 9-frame sequence images, covering a total of 21 domains of procedural sequences. For MakeAnything training, each domain consists of **50 sequences**, with resolutions of either **1024 (4-frame)** or **1056 (9-frame)**. Additionally, we provide an extensive collection of SVG datasets and Sketch datasets for further research and experimentation.
198
-
199
- Note that the arrangement of **9-frame sequences follows an S-shape pattern**, whereas **4-frame sequences follow a ɔ-shape pattern**.
200
-
201
- <details>
202
- <summary>Click to preview the datasets</summary>
203
- <br>
204
-
205
- | Domain | Preview | Quantity | Domain | Preview | Quantity |
206
- |:--------:|:---------:|:----------:|:--------:|:---------:|:----------:|
207
- | LEGO | ![LEGO Preview](./images/datasets/lego.png) | 50 | Cook | ![Cook Preview](./images/datasets/cook.png) | 50 |
208
- | Painting | ![Painting Preview](./images/datasets/painting.png) | 50 | Icon | ![Icon Preview](./images/datasets/icon.png) | 50+1.4k |
209
- | Landscape Illustration | ![Landscape Illustration Preview](./images/datasets/landscape.png) | 50 | Portrait | ![Portrait Preview](./images/datasets/portrait.png) | 50+2k |
210
- | Transformer | ![Transformer Preview](./images/datasets/transformer.png) | 50 | Sand Art | ![Sand Art Preview](./images/datasets/sandart.png) | 50 |
211
- | Illustration | ![Illustration Preview](./images/datasets/illustration.png) | 50 | Sketch | ![Sketch Preview](./images/datasets/sketch.png) | 50+9k |
212
- | Clay Toys | ![Clay Toys Preview](./images/datasets/claytoys.png) | 50 | Clay Sculpture | ![Clay Sculpture Preview](./images/datasets/claysculpture.png) | 50 |
213
- | ZBrush Modeling | ![ZBrush Modeling Preview](./images/datasets/zbrush.png) | 50 | Wood Sculpture | ![Wood Sculpture Preview](./images/datasets/woodsculpture.png) | 50 |
214
- | Ink Painting | ![Ink Painting Preview](./images/datasets/inkpainting.png) | 50 | Pencil Sketch | ![Pencil Sketch Preview](./images/datasets/pencilsketch.png) | 50 |
215
- | Fabric Toys | ![Fabric Toys Preview](./images/datasets/fabrictoys.png) | 50 | Oil Painting | ![Oil Painting Preview](./images/datasets/oilpainting.png) | 50 |
216
- | Jade Carving | ![Jade Carving Preview](./images/datasets/jadecarving.png) | 50 | Line Draw | ![Line Draw Preview](./images/datasets/linedraw.png) | 50 |
217
- | Emoji | ![Emoji Preview](./images/datasets/emoji.png) | 50+12k | | | |
218
-
219
- </details>
220
-
221
- ## Results
222
- ### Text-to-Sequence Generation (LoRA & Asymmetric LoRA)
223
- <img src='./images/t2i.png' width='100%' />
224
-
225
- ### Image-to-Sequence Generation (Recraft Model)
226
- <img src='./images/i2i.png' width='100%' />
227
-
228
- ### Generalization on Unseen Domains
229
- <img src='./images/oneshot.png' width='100%' />
230
-
231
- ## Citation
232
- ```
233
- @inproceedings{Song2025MakeAnythingHD,
234
- title={MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation},
235
- author={Yiren Song and Cheng Liu and Mike Zheng Shou},
236
- year={2025},
237
- url={https://api.semanticscholar.org/CorpusID:276107845}
238
- }
239
- ```
 
1
+ ---
2
+ title: "MakeAnything"
3
+ emoji: "🤖"
4
+ colorFrom: "red"
5
+ colorTo: "yellow"
6
+ sdk: "gradio"
7
+ sdk_version: "3.6"
8
+ app_file: gradio_app.py
9
+ pinned: false
10
+ ---