Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

ICCV 2025

The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other. This approach enhances all three tasks simultaneously. Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches.

Key contributions:

MaskPose: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
- Download pre-trained weights below
BBox-MaskPose (BMP): method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
- Try the demo!
Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
- Download pre-trained weights below
Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.

For more details, see the GitHub repository.

📝 Models List

ViTPose-b multi-dataset
MaskPose-b
fine-tuned RTMDet-l

See details of each model below.

1. ViTPose-B [multi-dataset]

Model type: ViT-b backbone with multi-layer decoder
Input: RGB images (192x256)
Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
Language(s): Not language-dependent (vision model)
License: GPL-3.0
Framework: MMPose

Training Details

Training data: COCO Dataset, MPII Dataset, AIC Datasel
Training script: GitHub - BBoxMaskPose_code
Epochs: 210
Batch size: 64
Learning rate: 5e-5
Hardware: 4x NVIDIA A-100

What's new? ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose. The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.

2. MaskPose-B

Model type: ViT-b backbone with multi-layer decoder
Input: RGB images (192x256) + estimated instance segmentation
Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
Language(s): Not language-dependent (vision model)
License: GPL-3.0
Framework: MMPose

Training Details

Training data: COCO Dataset, MPII Dataset, AIC Datasel + SAM-estimated instance masks
Training script: GitHub - BBoxMaskPose_code
Epochs: 210
Batch size: 64
Learning rate: 5e-5
Hardware: 4x NVIDIA A-100

What's new? Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes. No computational overhead compared to ViTPose.

3. fine-tuned RTMDet-L

Model type: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
Input: RGB images
Output: Detected instances -- bbox, instance mask and class for each
Language(s): Not language-dependent (vision model)
License: GPL-3.0
Framework: MMDetection

Training Details

Training data: COCO Dataset with randomly masked-out instances
Training script: GitHub - BBoxMaskPose_code
Epochs: 10
Batch size: 16
Learning rate: 2e-2
Hardware: 4x NVIDIA A-100

What's new? RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection. Especially effective in multi-body scenes where background would not be detected otherwise.

📄 Citation

If you use our work, please cite:

@InProceedings{Purkrabek2025ICCV,
  author={Purkrabek, Miroslav and Matas, Jiri},
  title={Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle}, 
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025},
  month={October},
}

🧑‍💻 Authors

Miroslav Purkrabek (personal website)
Jiri Matas (personal website)