arxiv:2502.01572

MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

Published on Feb 3

· Submitted by

Yiren Song on Feb 4

Upvote

Authors:

Yiren Song ,

Cheng Liu ,

Mike Zheng Shou

Abstract

A diffusion transformer-based framework, MakeAnything, generates coherent procedural tutorials across domains using a large dataset and asymmetric low-rank adaptation, outperforming existing methods.

AI-generated summary

A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by freezing encoder parameters while adaptively tuning decoder layers. Additionally, our ReCraft model enables image-to-process generation through spatiotemporal consistency constraints, allowing static images to be decomposed into plausible creation sequences. Extensive experiments demonstrate that MakeAnything surpasses existing methods, setting new performance benchmarks for procedural generation tasks.

View arXiv page View PDF Add to collection