Update pipeline tag and add library name
#1
by
						
nielsr
	
							HF Staff
						- opened
							
					
    	
        README.md
    CHANGED
    
    | @@ -1,129 +1,26 @@ | |
| 1 | 
             
            ---
         | 
| 2 | 
            -
            license:  | 
| 3 | 
            -
             | 
| 4 | 
            -
             | 
| 5 | 
            -
            datasets:
         | 
| 6 | 
            -
            - ILSVRC/imagenet-1k
         | 
| 7 | 
            -
            pipeline_tag: image-feature-extraction
         | 
| 8 | 
             
            ---
         | 
| 9 |  | 
|  | |
| 10 |  | 
| 11 | 
            -
             | 
| 12 |  | 
| 13 | 
            -
            ##  | 
|  | |
|  | |
| 14 |  | 
| 15 | 
            -
             | 
| 16 | 
            -
             | 
| 17 | 
            -
            ## Model Performance
         | 
| 18 | 
            -
             | 
| 19 | 
            -
            MambaVision demonstrates a strong performance by achieving a new SOTA Pareto-front in
         | 
| 20 | 
            -
            terms of Top-1 accuracy and throughput. 
         | 
| 21 | 
            -
             | 
| 22 | 
            -
            <p align="center">
         | 
| 23 | 
            -
            <img src="https://github.com/NVlabs/MambaVision/assets/26806394/79dcf841-3966-4b77-883d-76cd5e1d4320" width=70% height=70% 
         | 
| 24 | 
            -
            class="center">
         | 
| 25 | 
            -
            </p>
         | 
| 26 | 
            -
             | 
| 27 | 
            -
             | 
| 28 | 
            -
            ## Model Usage
         | 
| 29 | 
            -
             | 
| 30 | 
            -
            It is highly recommended to install the requirements for MambaVision by running the following:
         | 
| 31 | 
            -
             | 
| 32 | 
            -
             | 
| 33 | 
            -
            ```Bash
         | 
| 34 | 
            -
            pip install mambavision
         | 
| 35 | 
             
            ```
         | 
| 36 | 
            -
             | 
| 37 | 
            -
             | 
| 38 | 
            -
             | 
| 39 | 
            -
             | 
| 40 | 
            -
             | 
| 41 | 
            -
             | 
| 42 | 
            -
             | 
| 43 | 
            -
             | 
| 44 | 
            -
             | 
| 45 | 
            -
             | 
| 46 | 
            -
            <p align="center">
         | 
| 47 | 
            -
            <img src="https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/4duSnqLf4lrNiAHczSmAN.jpeg" width=70% height=70% 
         | 
| 48 | 
            -
            class="center">
         | 
| 49 | 
            -
            </p>
         | 
| 50 | 
            -
             | 
| 51 | 
            -
             | 
| 52 | 
            -
            The following snippet can be used for image classification:
         | 
| 53 | 
            -
             | 
| 54 | 
            -
            ```Python
         | 
| 55 | 
            -
            from transformers import AutoModelForImageClassification
         | 
| 56 | 
            -
            from PIL import Image
         | 
| 57 | 
            -
            from timm.data.transforms_factory import create_transform
         | 
| 58 | 
            -
            import requests
         | 
| 59 | 
            -
             | 
| 60 | 
            -
            model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-B-1K", trust_remote_code=True)
         | 
| 61 | 
            -
             | 
| 62 | 
            -
            # eval mode for inference
         | 
| 63 | 
            -
            model.cuda().eval()
         | 
| 64 | 
            -
             | 
| 65 | 
            -
            # prepare image for the model
         | 
| 66 | 
            -
            url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
         | 
| 67 | 
            -
            image = Image.open(requests.get(url, stream=True).raw)
         | 
| 68 | 
            -
            input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions
         | 
| 69 | 
            -
             | 
| 70 | 
            -
            transform = create_transform(input_size=input_resolution,
         | 
| 71 | 
            -
                                         is_training=False,
         | 
| 72 | 
            -
                                         mean=model.config.mean,
         | 
| 73 | 
            -
                                         std=model.config.std,
         | 
| 74 | 
            -
                                         crop_mode=model.config.crop_mode,
         | 
| 75 | 
            -
                                         crop_pct=model.config.crop_pct)
         | 
| 76 | 
            -
             | 
| 77 | 
            -
            inputs = transform(image).unsqueeze(0).cuda()
         | 
| 78 | 
            -
            # model inference
         | 
| 79 | 
            -
            outputs = model(inputs)
         | 
| 80 | 
            -
            logits = outputs['logits'] 
         | 
| 81 | 
            -
            predicted_class_idx = logits.argmax(-1).item()
         | 
| 82 | 
            -
            print("Predicted class:", model.config.id2label[predicted_class_idx])
         | 
| 83 | 
            -
            ```
         | 
| 84 | 
            -
             | 
| 85 | 
            -
            The predicted label is ```brown bear, bruin, Ursus arctos.```
         | 
| 86 | 
            -
             | 
| 87 | 
            -
            ### Feature Extraction
         | 
| 88 | 
            -
             | 
| 89 | 
            -
            MambaVision can also be used as a generic feature extractor. 
         | 
| 90 | 
            -
             | 
| 91 | 
            -
            Specifically, we can extract the outputs of each stage of model (4 stages) as well as the final averaged-pool features that are flattened. 
         | 
| 92 | 
            -
             | 
| 93 | 
            -
            The following snippet can be used for feature extraction:
         | 
| 94 | 
            -
             | 
| 95 | 
            -
            ```Python
         | 
| 96 | 
            -
            from transformers import AutoModel
         | 
| 97 | 
            -
            from PIL import Image
         | 
| 98 | 
            -
            from timm.data.transforms_factory import create_transform
         | 
| 99 | 
            -
            import requests
         | 
| 100 | 
            -
             | 
| 101 | 
            -
            model = AutoModel.from_pretrained("nvidia/MambaVision-B-1K", trust_remote_code=True)
         | 
| 102 | 
            -
             | 
| 103 | 
            -
            # eval mode for inference
         | 
| 104 | 
            -
            model.cuda().eval()
         | 
| 105 | 
            -
             | 
| 106 | 
            -
            # prepare image for the model
         | 
| 107 | 
            -
            url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
         | 
| 108 | 
            -
            image = Image.open(requests.get(url, stream=True).raw)
         | 
| 109 | 
            -
            input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions
         | 
| 110 | 
            -
             | 
| 111 | 
            -
            transform = create_transform(input_size=input_resolution,
         | 
| 112 | 
            -
                                         is_training=False,
         | 
| 113 | 
            -
                                         mean=model.config.mean,
         | 
| 114 | 
            -
                                         std=model.config.std,
         | 
| 115 | 
            -
                                         crop_mode=model.config.crop_mode,
         | 
| 116 | 
            -
                                         crop_pct=model.config.crop_pct)
         | 
| 117 | 
            -
            inputs = transform(image).unsqueeze(0).cuda()
         | 
| 118 | 
            -
            # model inference
         | 
| 119 | 
            -
            out_avg_pool, features = model(inputs)
         | 
| 120 | 
            -
            print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 640])
         | 
| 121 | 
            -
            print("Number of stages in extracted features:", len(features)) # 4 stages
         | 
| 122 | 
            -
            print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 80, 56, 56])
         | 
| 123 | 
            -
            print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 640, 7, 7])
         | 
| 124 | 
            -
            ```
         | 
| 125 | 
            -
             | 
| 126 | 
            -
             | 
| 127 | 
            -
            ### License: 
         | 
| 128 | 
            -
             | 
| 129 | 
            -
            [NVIDIA Source Code License-NC](https://huggingface.co/nvidia/MambaVision-T-1K/blob/main/LICENSE)
         | 
|  | |
| 1 | 
             
            ---
         | 
| 2 | 
            +
            license: apache-2.0
         | 
| 3 | 
            +
            task_categories:
         | 
| 4 | 
            +
            - video-text-to-text
         | 
|  | |
|  | |
|  | |
| 5 | 
             
            ---
         | 
| 6 |  | 
| 7 | 
            +
            This repository contains the data for the paper [PAVE: Patching and Adapting Video Large Language Models](https://arxiv.org/abs/2503.19794).
         | 
| 8 |  | 
| 9 | 
            +
            Code: https://github.com/dragonlzm/PAVE
         | 
| 10 |  | 
| 11 | 
            +
            ## Citation [optional]
         | 
| 12 | 
            +
            arxiv.org/abs/2503.19794
         | 
| 13 | 
            +
            <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
         | 
| 14 |  | 
| 15 | 
            +
            **BibTeX:**
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 16 | 
             
            ```
         | 
| 17 | 
            +
            @misc{liu2025pavepatchingadaptingvideo,
         | 
| 18 | 
            +
                  title={PAVE: Patching and Adapting Video Large Language Models}, 
         | 
| 19 | 
            +
                  author={Zhuoming Liu and Yiquan Li and Khoi Duc Nguyen and Yiwu Zhong and Yin Li},
         | 
| 20 | 
            +
                  year={2025},
         | 
| 21 | 
            +
                  eprint={2503.19794},
         | 
| 22 | 
            +
                  archivePrefix={arXiv},
         | 
| 23 | 
            +
                  primaryClass={cs.CV},
         | 
| 24 | 
            +
                  url={https://arxiv.org/abs/2503.19794}, 
         | 
| 25 | 
            +
            }
         | 
| 26 | 
            +
            ```
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
