shreydan's picture
Update README.md
3965d42 verified
metadata
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
language:
  - en
base_model:
  - HuggingFaceTB/SmolVLM-256M-Instruct

SmolVLM-256M-Detection

experimental and for learning purposes; wouldn't recommend using unless

check out github/shreydan/VLM-OD for results and details.

Usage

  • load the model same as HuggingFaceTB/SmolVLM-256M-Instruct
  • inputs: detect car / detect person;car etc. Apply chat template with add_generation_prompt=True
  • parse the output tokens <loc000> to <loc255> (code in eval.ipynb of my github repo)
  • to reiterate I have not added any <locXXX> special tokens (that needs wayyy more training than this method), the model itself is generating them.

outputs