shreydan
/

SmolVLM-256M-Detection

Image-Text-to-Text

Model card Files Files and versions

SmolVLM-256M-Detection / README.md

shreydan's picture

Update README.md

3965d42 verified 4 months ago

|

history blame contribute delete

864 Bytes

	---
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-text-to-text
	language:
	- en
	base_model:
	- HuggingFaceTB/SmolVLM-256M-Instruct
	---

	# SmolVLM-256M-Detection

	> experimental and for learning purposes; wouldn't recommend using unless

	check out [github/shreydan/VLM-OD](https://github.com/shreydan/VLM-OD/) for results and details.

	## Usage

	- load the model same as `HuggingFaceTB/SmolVLM-256M-Instruct`
	- inputs: `detect car` / `detect person;car` etc. Apply chat template with `add_generation_prompt=True`
	- parse the output tokens `<loc000>` to `<loc255>` (code in `eval.ipynb` of my github repo)
	- to reiterate I have not added any `<locXXX>` special tokens (that needs wayyy more training than this method), the model itself is generating them.

	![outputs](https://raw.githubusercontent.com/shreydan/VLM-OD/refs/heads/master/outputs.png)