Leveraging generic foundation models for multimodal surgical data analysis
Model repository for the paper Leveraging generic foundation models for multimodal surgical data analysis by Simon Pezold, Jérôme A. Kurylec, Jan S. Liechti, Beat P. Müller, and Joël L. Lavanchy. For more details, see the paper and its code repository.
Model weights and usage
Finetuned V-JEPA (ours): We finetuned V-JEPA ViT-L on unlabeled endoscopic surgical videos, as described in the paper. The resulting model weights can be downloaded here.
Pretrained V-JEPA (third-party): The model weights that formed the basis for finetuning can be downloaded via the links provided in Meta Research's V-JEPA v1 code repository – use the link under model zoo / pretrained models / ViT-L / checkpoint or this direct link. Please note that we are not affiliated with Meta Research in any way.
For details on how to use the provided model weights with the approach proposed in our paper, follow the instructions in our code repository.
Data, distribution, and licensing
As we utilized public datasets for finetuning, we distribute our model weights under the same license: CC BY-NC-SA 4.0. For more details, please refer to the license file linked above. For the license of the code, see the code repository.
Videos from the following datasets were used for finetuning:
Heidelberg colorectal (HeiCo), as described in
- Maier-Hein, L., Wagner, M., Ross, T., Reinke, A., Bodenstedt, S., Full, P. M., ... & Müller-Stich, B. P. (2021). Heidelberg colorectal data set for surgical data science in the sensor operating room. Scientific data, 8(1), 1-11
- Roß, T., Reinke, A., Full, P. M., Wagner, M., Kenngott, H., Apitz, M., ... & Maier-Hein, L. (2021). Comparative validation of multi-instance instrument segmentation in endoscopy: results of the ROBUST-MIS 2019 challenge. Medical image analysis, 70, 101920
Cholec80, as described in
A.P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. de Mathelin, N. Padoy, EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos, IEEE Transactions on Medical Imaging (TMI), arXiv preprint, 2017
MultiBypass140, as described in
Lavanchy, J.L., Ramesh, S., Dall’Alba, D. et al. Challenges in multi-centric generalization: phase and step recognition in Roux-en-Y gastric bypass surgery. Int J CARS 19, 2249–2257 (2024)
We are grateful to all authors for creating and publicly releasing these datasets!
Citing
If you find our work useful, please consider citing:
@article{pezold2025leveraging,
title = {Leveraging Generic Foundation Models for Multimodal Surgical Data Analysis},
author = {Pezold, Simon and Kurylec, Jérôme A. and Liechti, Jan S. and Müller, Beat P. and Lavanchy, Joël L.},
journal = {arXiv preprint arXiv:2509.06831},
year = {2025}
}