Basit mustafa's picture
3 12

Basit mustafa

BasitMustafa
Β·

AI & ML interests

None yet

Recent Activity

Organizations

MLX Community's profile picture Procurement sciences, inc's profile picture

BasitMustafa's activity

reacted to merve's post with πŸ”₯ 5 days ago
view post
Post
4841
A ton of impactful models and datasets in open AI past week, let's summarize the best 🀩 merve/releases-apr-21-and-may-2-6819dcc84da4190620f448a3

πŸ’¬ Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B 🀯 as well as Qwen2.5-Omni, any-to-any model in 3B and 7B!
> Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes)
> NVIDIA released new CoT reasoning datasets
πŸ–ΌοΈ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model
> Meta released EdgeTAM, an on-device object tracking model (SAM2 variant)
πŸ—£οΈ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model
> Nari released Dia, a 1.6B text-to-speech model
> Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model
πŸ‘©πŸ»β€πŸ’» JetBrains released Melium models in base and SFT for coding
> Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model 🀩
reacted to merve's post with πŸ”₯ 8 months ago
view post
Post
5655
I have put together a notebook on Multimodal RAG, where we do not process the documents with hefty pipelines but natively use:
- vidore/colpali for retrieval πŸ“– it doesn't need indexing with image-text pairs but just images!
- Qwen/Qwen2-VL-2B-Instruct for generation πŸ’¬ directly feed images as is to a vision language model with no processing to text!
I used ColPali implementation of the new 🐭 Byaldi library by @bclavie πŸ€—
https://github.com/answerdotai/byaldi
Link to notebook: https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb