Edit model card

Model Card for Phikon-v2

Phikon-v2 is a Vision Transformer Large pre-trained with Dinov2 self-supervised method on PANCAN-XL, a dataset of 450M 20x magnification histology images sampled from 60K whole slide images. PANCAN-XL only incorporates publicly available datasets: CPTAC (6,193 WSI) and TCGA (29,502 WSI) for malignant tissue, and GTEx for normal tissue (13,302 WSI).

Phikon-v2 improves upon Phikon, our previous fondation model pre-trained with iBOT on 40M histology images from TCGA (6k WSI), on a large variety of weakly-supervised tasks tailored for biomarker discovery. Phikon-v2 is evaluated on external cohorts to avoid any data contamination with PANCAN-XL pre-training dataset, and benchmarked against an exhaustive panel of representation learning and foundation models.

Model Description

  • Developed by: Owkin, Inc
  • Model type: Pretrained vision backbone (ViT-L/16 via DINOv2)
  • Pretraining dataset: PANCAN-XL, sourced from public histology collections (TCGA, CPTAC, GTEx, TCIA and others).
  • Paper: Arxiv
  • License: Owkin non-commercical licence

How To Use (Feature Extraction)

The following code snippet allows you to extract features from histology images using Phikon-v2 (CLS token). These features can then be used for downstream applications such as ROI classification (via linear or knn probing), slide classification (via multiple instance learning), segmentation (via ViT-Adapter for instance), etc.

from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel


# Load an image
image = Image.open(
    requests.get(
        "https://github.com/owkin/HistoSSLscaling/blob/main/assets/example.tif?raw=true",
        stream=True
    ).raw
)

# Load phikon-v2
processor = AutoImageProcessor.from_pretrained("owkin/phikon-v2")
model = AutoModel.from_pretrained("owkin/phikon-v2")
model.eval()

# Process the image
inputs = processor(image, return_tensors="pt")

# Get the features
with torch.inference_mode():
    outputs = model(**inputs)
    features = outputs.last_hidden_state[:, 0, :]  # (1, 1024) shape

assert features.shape == (1, 1024)

Direct Use (with Pre-Extracted and Frozen Features)

Phikon-v2 can be used with or without fine-tuning on different downstream applications, on top of which slide-classification using multiple instance learning algorithms (such as ABMIL).

Downstream Use (Finetuning)

You can fine-tune the model on tile-level downstream tasks. This Colab notebook allows you to fine-tune Phikon and Phikon-v2 using LoRa through the huggingface API.

Training Details

  • Training data: PANCAN-XL, a pretraining dataset composed of 456,060,584 [224×224] histology images at 20× resolution, sampled from 60k H&E WSIs.
  • Training regime: fp16 using PyTorch-FSDP mixed-precision.
  • Training objective: DINOv2 SSL recipe with the following losses:
    • DINO self-distillation loss with multi-crop
    • iBOT masked-image modeling loss
    • KoLeo regularization on [CLS] tokens
  • Training length: 100,000 iterations with a batch size of 4,096
  • Model architecture: ViT-Large (0.3B params): Patch size 16, embedding dimension 1024, 16 heads, MLP FFN
  • Hardware used: 32x4 Nvidia V100 32GB
  • Hours trained: Approx 4,300 GPU hours (33 hours total)
  • Platform: French supercluster Jean-Zay

Software Dependencies

Python Packages

Repositories

Contact

For any additional questions or comments, contact Alexandre Filiot ([email protected]).

How to cite

@misc{filiot2024phikonv2largepublicfeature,
      title={Phikon-v2, A large and public feature extractor for biomarker prediction}, 
      author={Alexandre Filiot and Paul Jacob and Alice Mac Kain and Charlie Saillard},
      year={2024},
      eprint={2409.09173},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2409.09173}, 
}

Acknowledgements

We thank DINOv2 authors for the amazing contribution [1].

Computing resources

This work was granted access to the HPC resources of IDRIS under the allocation 2023-A0141012519 made by GENCI.

Datasets

The results published here are partly based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The data used for the analyses described in this manuscript were obtained from the GTEx Portal on 07/01/2023.

Third-party licenses

Vision Transformers architectures were derived from facebookresearch/dino (Apache License 2.0), huggingface/pytorch-image-models (Apache License 2.0). This code is built upon DINOv2 repository (Apache License 2.0).

The following table provides the license associated with each datasets used for pre-training Phikon-v2.

Name of the dataset License Dataset home page
TCGA Open Access https://portal.gdc.cancer.gov/
TCIA [2] TCIA Restricted Licence https://www.cancerimagingarchive.net/
CPTAC [3-14] CC-BY 3.0 License https://proteomics.cancer.gov/programs/cptac
GTEX Open Access https://gtexportal.org/home/downloads/adult-gtex/overview
Biobank-CMB [15 - 19] CC BY 4.0 License https://moonshotbiobank.cancer.gov/
UPENN-GBM [20] CC BY 4.0 License https://www.cancerimagingarchive.net/collection/upenn-gbm/
Post-NAT-BRCA [21] CC BY 3.0 License https://www.cancerimagingarchive.net/collection/post-nat-brca/
Breast Metastases (MSKCC) [22] CC BY 3.0 License https://www.cancerimagingarchive.net/collection/sln-breast/
HER2 Tumor ROIs (v3) [23] CC BY 4.0 License https://www.cancerimagingarchive.net/collection/her2-tumor-rois/
TUH DPath Breast Free and Without Restriction https://isip.piconepress.com/projects/nedc/html/tuh_dpath/
Hungarian Colorectal Screening [24] CC BY 4.0 License https://www.cancerimagingarchive.net/collection/hungarian-colorectal-screening/
PennyCuick [25] CC BY 4.0 License https://idr.openmicroscopy.org/webclient/?show=project-1251
NLST-pathology-1225 [26] CC BY 4.0 License https://www.cancerimagingarchive.net/collection/nlst/
Ovarian Bevacizumab Response [27] CC BY 4.0 License https://www.cancerimagingarchive.net/collection/ovarian-bevacizumab-response/
PTRC-HGSOC [28] CC BY 4.0 License https://www.cancerimagingarchive.net/collection/ptrc-hgsoc/
Hodis [29] CC BY 4.0 License https://idr.openmicroscopy.org/webclient/?show=project-2351

References

  1. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2024). Dinov2: Learning robust visual features without supervision. arXiv.

  2. Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository. Journal of Digital Imaging, 26(6), 1045–1057. Springer Science and Business Media LLC. https://doi.org/10.1007/s10278-013-9622-7

  3. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2019). The Clinical Proteomic Tumor Analysis Consortium Acute Myeloid Leukemia Collection (CPTAC-AML) (Version 4) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2019.B6FOE619

  4. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Glioblastoma Multiforme Collection (CPTAC-GBM) (Version 15) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.3RJE41Q1

  5. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2020). The Clinical Proteomic Tumor Analysis Consortium Breast Invasive Carcinoma Collection (CPTAC-BRCA) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.CAEM-YS80

  6. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2020). The Clinical Proteomic Tumor Analysis Consortium Colon Adenocarcinoma Collection (CPTAC-COAD) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.YZWQ-ZZ63

  7. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Head and Neck Squamous Cell Carcinoma Collection (CPTAC-HNSCC) (Version 16) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.UW45NH81

  8. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Clear Cell Renal Cell Carcinoma Collection (CPTAC-CCRCC) (Version 13) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.OBLAMN27

  9. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Lung Squamous Cell Carcinoma Collection (CPTAC-LSCC) (Version 15) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.6EMUB5L2

  10. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2019). The Clinical Proteomic Tumor Analysis Consortium Sarcomas Collection (CPTAC-SAR) (Version 10) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2019.9BT23R95

  11. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2020). The Clinical Proteomic Tumor Analysis Consortium Ovarian Serous Cystadenocarcinoma Collection (CPTAC-OV) (Version 3) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.ZS4A-JD58

  12. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Pancreatic Ductal Adenocarcinoma Collection (CPTAC-PDA) (Version 14) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.SC20FO18

  13. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Cutaneous Melanoma Collection (CPTAC-CM) (Version 11) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.ODU24GZE

  14. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2019). The Clinical Proteomic Tumor Analysis Consortium Uterine Corpus Endometrial Carcinoma Collection (CPTAC-UCEC) (Version 12) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.3R3JUISW

  15. Cancer Moonshot Biobank. (2022). Cancer Moonshot Biobank – Colorectal Cancer Collection (CMB-CRC) (Version 5) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/DJG7-GZ87

  16. Cancer Moonshot Biobank. (2022). Cancer Moonshot Biobank – Melanoma Collection (CMB-MEL) (Version 5) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/GWSP-WH72

  17. Cancer Moonshot Biobank. (2022). Cancer Moonshot Biobank – Gastroesophageal Cancer Collection (CMB-GEC) (Version 2) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/E7KH-R486

  18. Cancer Moonshot Biobank. (2022). Cancer Moonshot Biobank – Lung Cancer Collection (CMB-LCA) (Version 5) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/3CX3-S132

  19. Cancer Moonshot Biobank. (2022). Cancer Moonshot Biobank – Multiple Myeloma Collection (CMB-MML) (Version 4) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/SZKB-SW39

  20. Bakas, S., Sako, C., Akbari, H., Bilello, M., Sotiras, A., Shukla, G., Rudie, J. D., Flores Santamaria, N., Fathi Kazerooni, A., Pati, S., Rathore, S., Mamourian, E., Ha, S. M., Parker, W., Doshi, J., Baid, U., Bergman, M., Binder, Z. A., Verma, R., … Davatzikos, C. (2021). Multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System (UPENN-GBM) (Version 2) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.709X-DN49

  21. Martel, A. L., Nofech-Mozes, S., Salama, S., Akbar, S., & Peikari, M. (2019). Assessment of residual breast cancer cellularity after neoadjuvant chemotherapy using digital pathology [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2019.4YIBTJNO

  22. Campanella, G., Hanna, M. G., Brogi, E., & Fuchs, T. J. (2019). Breast metastases to axillary lymph nodes [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2019.3XBN2JCC

  23. Farahmand, S., Fernandez, A. I., Ahmed, F. S., Rimm, D. L., Chuang, J. H., Reisenbichler, E., & Zarringhalam, K. (2022). HER2 and trastuzumab treatment response H&E slides with tumor ROI annotations (Version 3) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/E65C-AM96

  24. Pataki, B. A., Olar, A., Ribli, D., Pesti, A., Kontsek, E., Gyongyosi, B., Bilecz, A., Kovács, T., Kovács, K. A., Kiss, Z., Szócska, M., Pollner, P., & Csabai, I. (2021). Digital pathological slides from Hungarian (Europe) colorectal cancer screening (Version 2) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.9CJF-0127

  25. Pennycuick, A., Teixeira, V. H., AbdulJabbar, K., Raza, S. E. A., Lund, T., Akarca, A. U., Rosenthal, R., Kalinke, L., Chandrasekharan, D. P., Pipinikas, C. P., Lee-Six, H., Hynds, R. E., Gowers, K. H. C., Henry, J. Y., Millar, F. R., Hagos, Y. B., Denais, C., Falzon, M., Moore, D. A., Antoniou, S., Durrenberger, P. F., Furness, A. J., Carroll, B., Marceaux, C., Asselin-Labat, M. L., Larson, W., Betts, C., Coussens, L. M., Thakrar, R. M., George, J., Swanton, C., Thirlwell, C., Campbell, P. J., Marafioti, T., Yuan, Y., Quezada, S. A., McGranahan, N., & Janes, S. M. (2020). Immune surveillance in clinical regression of preinvasive squamous cell lung cancer. Cancer Discovery, 10(10), 1489-1499. https://doi.org/10.1158/2159-8290.CD-19-1366

  26. National Lung Screening Trial Research Team. (2013). Data from the National Lung Screening Trial (NLST) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.HMQ8-J677

  27. Wang, C.-W., Chang, C.-C., Lo, S.-C., Lin, Y.-J., Liou, Y.-A., Hsu, P.-C., Lee, Y.-C., & Chao, T.-K. (2021). A dataset of histopathological whole slide images for classification of treatment effectiveness to ovarian cancer (Ovarian Bevacizumab Response) (Version 2) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.985G-EY35

  28. Chowdhury, S., Kennedy, J. J., Ivey, R. G., Murillo, O., Hosseini, N., Song, X., Petralia, F., Calinawan, A., Voytovich, U. J., Savage, S. R., Berry, A., Reva, B., Ozbek, U., Krek, A., Ma, W., da Veiga Leprevost, F., Ji, J., Yoo, S., Lin, C., … Paulovich, A. G. (2023). Proteogenomic analysis of chemo-refractory high grade serous ovarian cancer (PTRC-HGSOC) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/6RDA-P940

  29. Hodis, E., Torlai Triglia, E., Kwon, J. Y. H., Biancalani, T., Zakka, L. R., Parkar, S., Hütter, J. C., Buffoni, L., Delorey, T. M., Phillips, D., Dionne, D., Nguyen, L. T., Schapiro, D., Maliga, Z., Jacobson, C. A., Hendel, A., Rozenblatt-Rosen, O., Mihm, M. C. Jr., Garraway, L. A., & Regev, A. (2022). Stepwise-edited, human melanoma models reveal mutations' effect on tumor and microenvironment. Science, 376(6592), eabi8175. https://doi.org/10.1126/science.abi8175

Downloads last month
2,033
Safetensors
Model size
303M params
Tensor type
F32
·
Inference API
Inference API (serverless) does not yet support transformers models for this pipeline type.