arxiv:2509.10687

Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

Published on Sep 12

· Submitted by

Authors:

Abstract

SP4D generates paired RGB and kinematic part videos from monocular inputs using a dual-branch diffusion model with spatial color encoding and BiDiFuse module, demonstrating strong generalization to diverse scenarios.

AI-generated summary

We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

View arXiv page View PDF Project page Add to collection

Community

haoz19

Paper submitter 7 days ago

Recently, together with collaborators at UIUC and Stability AI, we completed a new research project called Stable Part Diffusion 4D (SP4D).
📄 arXiv: https://arxiv.org/abs/2509.10687
🌐 Project Page: https://stablepartdiffusion4d.github.io/

💡 Why this work?

In animation and 3D content creation, making a character move requires rigging (skeleton binding) and part decomposition.
But existing methods face two big issues:

AutoRig methods → rely on very limited 3D datasets, so they generalize poorly.
Part segmentation methods → usually split regions by semantics/appearance (e.g., “head,” “leg”), which don’t align with true kinematic structures and often lack temporal or multi-view consistency.

So we asked:
👉 Can large-scale 2D prior knowledge be leveraged to solve 3D kinematic decomposition?
If so, not only could part segmentation become more stable, but this idea could also extend to automatic rigging.

🔧 Our Method

This led us to SP4D: the first multi-view video diffusion framework for kinematic part decomposition.

From just a video or a single image, SP4D generates multi-view RGB sequences + kinematic part decompositions.
These results can then be “lifted” into 3D to produce animatable meshes (with skeletons and skinning).
Technically, it uses a dual-branch diffusion model, a BiDiFuse cross-branch module, a contrastive consistency loss, and is trained with the new KinematicParts20K dataset (20K rigged objects).

📊 Key Results

Segmentation Accuracy (mIoU): SP4D = 0.68 vs. ~0.15–0.22 for baselines.
Consistency (ARI): SP4D = 0.60, while SAM2 only achieves 0.05.
User Study: SP4D scored 4.26/5 on clarity, consistency, and animation suitability, far higher than competing methods.
Rigging Precision: 72.7, significantly outperforming existing AutoRig approaches.

In short: cleaner, more stable results that better drive animation.

🚀 Applications

SP4D has strong potential in animation, gaming, digital humans, AR/VR, and robotics simulation.
More importantly, it validates a new paradigm:
👉 Using large-scale 2D priors to solve fundamental 3D kinematic challenges.

In the future, it may be possible to generate fully animatable 3D characters from just a single image or short video.

librarian-bot

6 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.10687 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.10687 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.10687 in a Space README.md to link it from this page.