Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

    ICCV 2025

The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other. This approach enhances all three tasks simultaneously. Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches.

Key contributions:

  1. MaskPose: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
    • Download pre-trained weights below
  2. BBox-MaskPose (BMP): method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
    • Try the demo!
  3. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
    • Download pre-trained weights below
  4. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.

arXiv           GitHub repository           Project Website

For more details, see the GitHub repository.

πŸ“ Models List

  1. ViTPose-b multi-dataset
  2. MaskPose-b
  3. fine-tuned RTMDet-l

See details of each model below.


1. ViTPose-B [multi-dataset]

  • Model type: ViT-b backbone with multi-layer decoder
  • Input: RGB images (192x256)
  • Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
  • Language(s): Not language-dependent (vision model)
  • License: GPL-3.0
  • Framework: MMPose

Training Details

What's new? ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose. The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.


2. MaskPose-B

  • Model type: ViT-b backbone with multi-layer decoder
  • Input: RGB images (192x256) + estimated instance segmentation
  • Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
  • Language(s): Not language-dependent (vision model)
  • License: GPL-3.0
  • Framework: MMPose

Training Details

What's new? Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes. No computational overhead compared to ViTPose.


3. fine-tuned RTMDet-L

  • Model type: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
  • Input: RGB images
  • Output: Detected instances -- bbox, instance mask and class for each
  • Language(s): Not language-dependent (vision model)
  • License: GPL-3.0
  • Framework: MMDetection

Training Details

What's new? RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection. Especially effective in multi-body scenes where background would not be detected otherwise.

πŸ“„ Citation

If you use our work, please cite:

@InProceedings{Purkrabek2025ICCV,
  author={Purkrabek, Miroslav and Matas, Jiri},
  title={Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle}, 
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025},
  month={October},
}

πŸ§‘β€πŸ’» Authors

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support