HuggingFaceTB
/

SmolVLM2-2.2B-Instruct

Image-Text-to-Text

video-text-to-text

Model card Files Files and versions Community

mfarre HF staff commited on Feb 20

Commit

06ac4f8

·

verified ·

1 Parent(s): 5719134

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -46,7 +46,7 @@ SmolVLM2-500M-Video is a lightweight multimodal model designed to analyze video
 SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.
-To fine-tune SmolVLM2 on a specific task, you can follow [the fine-tuning tutorial](UPDATE).
 ## Evaluation

 SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.
+To fine-tune SmolVLM2 on a specific task, you can follow [the fine-tuning tutorial](https://github.com/huggingface/smollm/blob/main/vision/finetuning/Smol_VLM_FT.ipynb).
 ## Evaluation