cnmoro
/

nano-image-captioning

vision-encoder-decoder

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

cnmoro commited on Jan 28

Commit

5162a2a

·

verified ·

1 Parent(s): 28b8643

Update README.md

Files changed (1) hide show

README.md +59 -3

README.md CHANGED Viewed

@@ -1,3 +1,59 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- WinKawaks/vit-tiny-patch16-224
+- google/bert_uncased_L-2_H-128_A-2
+pipeline_tag: image-to-text
+library_name: transformers
+tags:
+- vit
+- bert
+- vision
+- caption
+- captioning
+- image
+---
+An image captioning model, based on bert-tiny and vit-tiny, weighing only 40mb!
+Works very fast on CPU.
+```python
+from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel
+import requests, time
+from PIL import Image
+model_path = "cnmoro/nano-image-captioning"
+# load the image captioning model and corresponding tokenizer and image processor
+model = VisionEncoderDecoderModel.from_pretrained(model_path)
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+image_processor = AutoImageProcessor.from_pretrained(model_path)
+# preprocess an image
+url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+pixel_values = image_processor(image, return_tensors="pt").pixel_values
+start = time.time()
+# generate caption - suggested settings
+generated_ids = model.generate(
+    pixel_values,
+    temperature=0.7,
+    top_p=0.8,
+    top_k=50,
+    num_beams=3 # you can use 1 for even faster inference with a small drop in quality
+)
+generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+end = time.time()
+print(generated_text)
+# a group of people are in the middle of a city.
+print(f"Time taken: {end - start} seconds")
+# Time taken: 0.07550048828125 seconds
+# on CPU !
+```