Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation
Abstract
SP4D generates paired RGB and kinematic part videos from monocular inputs using a dual-branch diffusion model with spatial color encoding and BiDiFuse module, demonstrating strong generalization to diverse scenarios.
We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.
Community
Recently, together with collaborators at UIUC and Stability AI, we completed a new research project called Stable Part Diffusion 4D (SP4D).
π arXiv: https://arxiv.org/abs/2509.10687
π Project Page: https://stablepartdiffusion4d.github.io/
π‘ Why this work?
In animation and 3D content creation, making a character move requires rigging (skeleton binding) and part decomposition.
But existing methods face two big issues:
- AutoRig methods β rely on very limited 3D datasets, so they generalize poorly.
- Part segmentation methods β usually split regions by semantics/appearance (e.g., βhead,β βlegβ), which donβt align with true kinematic structures and often lack temporal or multi-view consistency.
So we asked:
π Can large-scale 2D prior knowledge be leveraged to solve 3D kinematic decomposition?
If so, not only could part segmentation become more stable, but this idea could also extend to automatic rigging.
π§ Our Method
This led us to SP4D: the first multi-view video diffusion framework for kinematic part decomposition.
- From just a video or a single image, SP4D generates multi-view RGB sequences + kinematic part decompositions.
- These results can then be βliftedβ into 3D to produce animatable meshes (with skeletons and skinning).
- Technically, it uses a dual-branch diffusion model, a BiDiFuse cross-branch module, a contrastive consistency loss, and is trained with the new KinematicParts20K dataset (20K rigged objects).
π Key Results
- Segmentation Accuracy (mIoU): SP4D = 0.68 vs. ~0.15β0.22 for baselines.
- Consistency (ARI): SP4D = 0.60, while SAM2 only achieves 0.05.
- User Study: SP4D scored 4.26/5 on clarity, consistency, and animation suitability, far higher than competing methods.
- Rigging Precision: 72.7, significantly outperforming existing AutoRig approaches.
In short: cleaner, more stable results that better drive animation.
π Applications
SP4D has strong potential in animation, gaming, digital humans, AR/VR, and robotics simulation.
More importantly, it validates a new paradigm:
π Using large-scale 2D priors to solve fundamental 3D kinematic challenges.
In the future, it may be possible to generate fully animatable 3D characters from just a single image or short video.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control (2025)
- 4DNeX: Feed-Forward 4D Generative Modeling Made Easy (2025)
- Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation (2025)
- Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation (2025)
- GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation (2025)
- Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos (2025)
- Compositional Video Synthesis by Temporal Object-Centric Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper