Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
ICCV 2025
Key contributions:
- MaskPose: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
- Download pre-trained weights below
- BBox-MaskPose (BMP): method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
- Try the demo!
- Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
- Download pre-trained weights below
- Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
For more details, see the GitHub repository.
π Models List
- ViTPose-b multi-dataset
- MaskPose-b
- fine-tuned RTMDet-l
See details of each model below.
1. ViTPose-B [multi-dataset]
- Model type: ViT-b backbone with multi-layer decoder
- Input: RGB images (192x256)
- Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
- Language(s): Not language-dependent (vision model)
- License: GPL-3.0
- Framework: MMPose
Training Details
- Training data: COCO Dataset, MPII Dataset, AIC Datasel
- Training script: GitHub - BBoxMaskPose_code
- Epochs: 210
- Batch size: 64
- Learning rate: 5e-5
- Hardware: 4x NVIDIA A-100
What's new? ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose. The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.
2. MaskPose-B
- Model type: ViT-b backbone with multi-layer decoder
- Input: RGB images (192x256) + estimated instance segmentation
- Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
- Language(s): Not language-dependent (vision model)
- License: GPL-3.0
- Framework: MMPose
Training Details
- Training data: COCO Dataset, MPII Dataset, AIC Datasel + SAM-estimated instance masks
- Training script: GitHub - BBoxMaskPose_code
- Epochs: 210
- Batch size: 64
- Learning rate: 5e-5
- Hardware: 4x NVIDIA A-100
What's new? Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes. No computational overhead compared to ViTPose.
3. fine-tuned RTMDet-L
- Model type: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
- Input: RGB images
- Output: Detected instances -- bbox, instance mask and class for each
- Language(s): Not language-dependent (vision model)
- License: GPL-3.0
- Framework: MMDetection
Training Details
- Training data: COCO Dataset with randomly masked-out instances
- Training script: GitHub - BBoxMaskPose_code
- Epochs: 10
- Batch size: 16
- Learning rate: 2e-2
- Hardware: 4x NVIDIA A-100
What's new? RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection. Especially effective in multi-body scenes where background would not be detected otherwise.
π Citation
If you use our work, please cite:
@InProceedings{Purkrabek2025ICCV,
author={Purkrabek, Miroslav and Matas, Jiri},
title={Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025},
month={October},
}
π§βπ» Authors
- Miroslav Purkrabek (personal website)
- Jiri Matas (personal website)