Cem Bıkmaz

ceofast

AI & ML interests

Computer Vision, Natural Language Processing, Large Language Models, Data Science, MLOps, Data Engineering

Recent Activity

updated a model about 2 months ago

ceofast/llama3.1-8b-instruct-turkish-sentiment-qlora

published a model 2 months ago

ceofast/llama3.1-8b-instruct-turkish-sentiment-qlora

updated a model 10 months ago

ceofast/multilingual-xlm-roberta-for-ner

View all activity

Organizations

updated a model about 2 months ago

ceofast/llama3.1-8b-instruct-turkish-sentiment-qlora

Text Classification • Updated May 5 • 7

published a model 2 months ago

ceofast/llama3.1-8b-instruct-turkish-sentiment-qlora

Text Classification • Updated May 5 • 7

updated a model 10 months ago

ceofast/multilingual-xlm-roberta-for-ner

0.3B • Updated Sep 20, 2024 • 26

liked a model 10 months ago

meta-llama/Llama-3.1-8B-Instruct

Text Generation • 8B • Updated Sep 25, 2024 • 5.18M • • 4.22k

updated a model 12 months ago

ceofast/Llama-3-8B-Instruct-Finance-RAG-Aviation-AI

Text Generation • 8B • Updated Jul 13, 2024 • 13

liked a model about 1 year ago

nvidia/NV-Embed-v1

8B • Updated Nov 30, 2024 • 3.68k • 427

reacted to merve's post with 🚀 about 1 year ago

Post

2063

Do we fully leverage ViT encoders in vision language models?

A new paper (by @HuanjinYao et al) built a dense connector that does it better! HuanjinYao/DenseConnector-v1.5-8B
HuanjinYao/denseconnector-66500e173fc8c9f05dc98dea

VLMs consist of an image encoder block, a projection layer that projects image embeddings to text embedding space and then a text decoder sequentially connected 📖
This paper explores using intermediate states of image encoder and not a single output 🤩
The authors explore three different ways of instantiating dense connector: sparse token integration, sparse channel integration and dense channel integration. (see paper on how they do it Dense Connector for MLLMs (2405.13800))

They explore all three of them integrated to LLaVA 1.5 and found out each of the new models are superior to the original LLaVA 1.5 🥹 I tried the model and it seems to work very well. As part of the release, the authors have released various ckpts based on different decoders (Vicuna 7/13B and Llama 3-8B) that you can find in the collection 🤗

upvoted a paper about 1 year ago

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Paper • 2310.09199 • Published Oct 13, 2023 • 28

liked a model about 1 year ago

nvidia/Llama3-ChatQA-1.5-8B

Text Generation • 8B • Updated May 24, 2024 • 13.9k • 552

updated 2 models over 1 year ago

ceofast/emotiom_analysis_with_distilbert

Text Classification • Updated Mar 3, 2024 • 21

ceofast/distilbert-emotion

Text Classification • 0.1B • Updated Feb 25, 2024 • 21 • 1

liked a model over 1 year ago

asafaya/kanarya-2b

Text Generation • 2B • Updated Mar 17, 2024 • 5.28k • 32

reacted to merve's post with ❤️ over 1 year ago

Post

Explaining the 👑 of zero-shot open-vocabulary object detection: OWLv2 🦉
OWLv2 is scaled version of a model called OWL-ViT, so let's take a look at that first. 📝
OWLViT is an open vocabulary object detector, meaning, it can detect objects it didn't explicitly see during the training. 👀
What's cool is that it can take both image and text queries! This is thanks to how the image and text features aren't fused together.

Taking a look at the architecture, the authors firstly do contrastive pre-training of a vision and a text encoder (just like CLIP). They take that model, remove the final pooling layer and attach a lightweight classification and box detection head and fine-tune.
During fine-tuning for object detection, they calculate the loss over bipartite matches. Simply put, loss is calculated over the predicted objects against ground truth objects and the goal is to find a perfect match of these two sets where each object is matched to one object in ground truth.

OWL-ViT is very scalable. You can easily scale most language models or vision-language models because they require no supervision, but this isn't the case for object detection: you still need weak supervision. Moreover, only scaling the encoders creates a bottleneck after a while.

The authors wanted to scale OWL-ViT with more data, so they used OWL-ViT for labelling to train a better detector, "self-train" a new detector on the labels, and fine-tune the model on human-annotated data.

Thanks to this, OWLv2 scaled very well and topped leaderboards on open vocabulary object detection 👑
If you'd like to try it out, I will leave couple of links with apps, notebooks and more in the comments! 🤗

2 replies

liked 2 models over 1 year ago

impira/layoutlm-document-qa

Document Question Answering • 0.1B • Updated Mar 18, 2023 • 26k • 1.12k

microsoft/layoutlmv3-base

0.1B • Updated Apr 10, 2024 • 1.41M • 418

Cem Bıkmaz

AI & ML interests

Recent Activity

Organizations

ceofast's activity