Leveraging generic foundation models for multimodal surgical data analysis

Model repository for the paper Leveraging generic foundation models for multimodal surgical data analysis by Simon Pezold, Jérôme A. Kurylec, Jan S. Liechti, Beat P. Müller, and Joël L. Lavanchy. For more details, see the paper and its code repository.

Model weights and usage

  • Finetuned V-JEPA (ours): We finetuned V-JEPA ViT-L on unlabeled endoscopic surgical videos, as described in the paper. The resulting model weights can be downloaded here.

  • Pretrained V-JEPA (third-party): The model weights that formed the basis for finetuning can be downloaded via the links provided in Meta Research's V-JEPA v1 code repository – use the link under model zoo / pretrained models / ViT-L / checkpoint or this direct link. Please note that we are not affiliated with Meta Research in any way.

For details on how to use the provided model weights with the approach proposed in our paper, follow the instructions in our code repository.

Data, distribution, and licensing

As we utilized public datasets for finetuning, we distribute our model weights under the same license: CC BY-NC-SA 4.0. For more details, please refer to the license file linked above. For the license of the code, see the code repository.

Videos from the following datasets were used for finetuning:

  • Heidelberg colorectal (HeiCo), as described in

    • Maier-Hein, L., Wagner, M., Ross, T., Reinke, A., Bodenstedt, S., Full, P. M., ... & Müller-Stich, B. P. (2021). Heidelberg colorectal data set for surgical data science in the sensor operating room. Scientific data, 8(1), 1-11
    • Roß, T., Reinke, A., Full, P. M., Wagner, M., Kenngott, H., Apitz, M., ... & Maier-Hein, L. (2021). Comparative validation of multi-instance instrument segmentation in endoscopy: results of the ROBUST-MIS 2019 challenge. Medical image analysis, 70, 101920
  • Cholec80, as described in

    A.P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. de Mathelin, N. Padoy, EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos, IEEE Transactions on Medical Imaging (TMI), arXiv preprint, 2017

  • MultiBypass140, as described in

    Lavanchy, J.L., Ramesh, S., Dall’Alba, D. et al. Challenges in multi-centric generalization: phase and step recognition in Roux-en-Y gastric bypass surgery. Int J CARS 19, 2249–2257 (2024)

We are grateful to all authors for creating and publicly releasing these datasets!

Citing

If you find our work useful, please consider citing:

@article{pezold2025leveraging,
    title = {Leveraging Generic Foundation Models for Multimodal Surgical Data Analysis}, 
    author = {Pezold, Simon and Kurylec, Jérôme A. and Liechti, Jan S. and Müller, Beat P. and Lavanchy, Joël L.},
    journal = {arXiv preprint arXiv:2509.06831},
    year = {2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support