🎯 This is the DPO-optimized version of the base model TraVisionLM-base.

When compared to the base model, the DPO version answers questions more accurately, truthfully, and with greater detail.

🤖 What is Direct Preference Optimization (DPO)?

Direct Preference Optimization is a technique used to align a model’s behavior with human preferences. The process works by showing the model several possible answers to a question and training it to favor the response preferred by humans. This leads to more reliable and truthful responses, as the model learns not only from raw data but also from user feedback. DPO helps to minimize hallucinations and improves the quality and accuracy of the model’s answers.

🚀 Model demo: TRaVisionLM-DPO-Demo

📚 Visual Language Model DPO Training Notebook: Colab Notebook

Model Description

Developed by: ucsahin
Model type: Image-Text-to-Text
Language(s) (NLP): Turkish
License: Apache license 2.0

English

🎉 Introducing TraVisionLM: The First of Its Kind! 🚀

🌟 This is a very fast and small (only 875M parameters) visual language model on Hugging Face that responds to Turkish instructions given an image input! 🌟

✨ Developed compatible with the Transformers library, TRaVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️

Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖

Türkçe

🎉 TraVisionLM: Türünün İlk Örneği! 🚀

🌟 Çok hızlı ve küçük boyutlu (sadece 875M parametre) Türkçe görsel dil modeli! Bir görüntü ve Türkçe talimat verildiğinde Türkçe yanıt üretir! 🌟

✨ Transformers kütüphanesi ile uyumlu olarak geliştirilen TraVisionLM modeli ile, yükleme, eğitme ve dış kütüphaneler kullanmadan hızlı sonuçlar almak çok kolay! ⚡️

Türkçe görsel dil modelini deneyimlemeye hazır mısınız? Hadi başlayalım! 🇹🇷🖼️🤖

How to Get Started with the Model

In Transformers, you can load the model and inference as follows:

IMPORTANT NOTE: TraVisionLM model is not yet integrated natively into the Transformers library. So you need to set trust_remote_code=True when loading the model. It will download the configuration_travisionlm.py, modeling_travisionlm.py and processing_travisionlm.py files from the repo. You can check out the content of these files under the Files and Versions tab and pin the specific versions if you have any concerns regarding malicious code.

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
import requests 
from PIL import Image

model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-DPO', trust_remote_code=True, device_map="cuda")
# you can also load the model in bfloat16 or float16
# model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-DPO', trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
processor = AutoProcessor.from_pretrained('ucsahin/TraVisionLM-DPO', trust_remote_code=True)

image = Image.open("galata.jpg").convert("RGB")

prompt = "Resimde gösterilen yapı hangi şehirdedir?"  # short caption
# prompt = "Detaylı açıkla"  # detailed caption
# prompt = "Kısaca açıkla" # short caption

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.6, top_p=0.9, top_k=50, repetition_penalty=1.2)

output_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print("Model response: ", output_text)

You can also perform batch inference as follows (make sure that all images have a prompt text associated with them):

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-DPO', trust_remote_code=True, device_map="cuda")
# you can also load the model in bfloat16 or float16
# model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-DPO', trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
processor = AutoProcessor.from_pretrained('ucsahin/TraVisionLM-DPO', trust_remote_code=True)

image = Image.open("galata.jpg").convert("RGB")

prompt_list = [
  'Kısaca açıkla',
  'Detaylı açıkla',
  'Resimde ne görünüyor?',
  'Gündüz vakti mi yoksa akşam vakti mi?',
  'Resimde ilgi çekici unsurlar nelerdir?',
]

inputs = processor(text=prompt_list, images=len(prompt_list)*[image], padding="longest", return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.6, top_p=0.9, top_k=50, repetition_penalty=1.2)

output_text_list = processor.batch_decode(outputs, skip_special_tokens=True)

for output_text in output_text_list:
  print(f"Model response: {output_text}\n\n\n")

The image used for code examples above:

Then, the output will look like this:

"""
Model response: Kısaca açıkla
Görsel, İstanbul'un Beyoğlu ilçesinde yer alan Galata Kulesi'ni göstermektedir. Bu kule, tarihi boyunca çeşitli mimari tarzlar ve estetik anlayışlarla öne çıkmaktadır.


Model response: Detaylı açıkla
Görselde, İstanbul'un tarihi yarımadasında yer alan Galata Kulesi dikkat çekiyor. Bu kule, hem mimari yapısı hem de tarihî önemi ile bölgenin önemli bir simgesi olarak öne çıkıyor. Kule, dönemin karakteristik özelliklerini taşırken, özellikle Osmanlı dönemine ait süslemeleri ve zarif hatlarıyla göz alıcı bir görünüme sahip.Kulenin üst kısmında, altın rengi bir kubbe yükseliyor; bu kubbenin ortasında belirgin bir şekilde yer alan detaylar, hem estetik açıdan hoş görünmekte hem de şehrin kültürel zenginliğini artırmaktadır.Kale etrafında ise çeşitli yapılar sıralanmış durumda. Bu yapıların mimarisi, farklı dönemlerden izler taşıyor gibi görünüyor; bazıları taşlardan inşa edilmişken, diğerleri daha modern tasarımlarla bezeli. Ayrıca, su yolu boyunca uzanan şehir manzarası, kentin dinamik yapısını ve denizle olan bağlantısını vurguluyor.Gece saatlerinde çekilmiş olan bu görüntüde, sıcak tonlardaki ışıkların ve gölgelerin oluşturduğu büyüleyici bir atmosfer hâkim. Bu durum, şehrin doğal güzelliklerinin yanı sıra, tarihsel dokusu ve kültürel zenginlikleri üzerinde düşünmeye davet ediyor. Tüm bu unsurlar, İstanbul’un kozmopolit yapısını ve kültürel çeşitliliğini gözler önüne seriyor.


Model response: Resimde ne görünüyor?
Görselde, İstanbul'un Beyoğlu ilçesindeki Galata Kulesi yer alıyor. Bu kule, tarihi ve kültürel zenginlikleri ile bilinir. Güneşin batışıyla birlikte sıcak tonlarda ışıkların yayıldığı bu alan, şehrin kozmopolit yapısını gözler önüne seriyor. Ayrıca, su kenarında inşa edilmiş binalar, hem tarihi dokunun hem de modern hayatın birleşimini sergiliyor.


Model response: Gündüz vakti mi yoksa akşam vakti mi?
Görseldeki ışıklandırma ve renk tonları, günün ilerleyen saatlerini işaret ediyor gibi görünüyor. Bu durum, şehir manzarasını daha da çekici kılıyor.


Model response: Resimde ilgi çekici unsurlar nelerdir?
Görselde dikkat çeken unsurlardan biri, tarihî ve kültürel önemi olan Galata Kulesi'dir. Kule, hem mimari yapısı hem de çevresindeki tarihi yapılarla birlikte, İstanbul'un önemli bir simgesi olarak öne çıkmaktadır. Ayrıca, gün batımı esnasında oluşan atmosferin ışığı, şehrin canlılığını artırmakta ve bu özel anın görsel zenginliğini artırmaktadır. Bu unsurlar, İstanbul'un sosyal ve ekonomik yapısını gözler önüne sermektedir.
"""