FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding
Abstract
Two merging strategies and a diffusion-based decoder improve 3D Human Mesh Recovery by reducing computational cost and slightly enhancing performance.
Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.
Community
TL;DR: FastHMR introduces two merging strategies, Error Constrained Layer Merging (ECLM) and Mask guided Token Merging (Mask ToMe), to reduce computational cost and redundancy in transformer based 3D Human Mesh Recovery. ECLM selectively merges layers with minimal impact on MPJPE, while Mask ToMe merges background tokens that contribute little to prediction. A diffusion based decoder further enhances performance by using temporal context and pose priors. The method achieves up to 2.3x faster inference while slightly improving accuracy across benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LieHMR: Autoregressive Human Mesh Recovery with $SO(3)$ Diffusion (2025)
- Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation (2025)
- Cascaded Diffusion Framework for Probabilistic Coarse-to-Fine Hand Pose Estimation (2025)
- GSFix3D: Diffusion-Guided Repair of Novel Views in Gaussian Splatting (2025)
- Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding (2025)
- Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning (2025)
- Stable Diffusion-Based Approach for Human De-Occlusion (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper