unimelb-nlp (The University of Melbourne)

posted an update 2 days ago

Post

5877

Dropped the HeadshotX : a super-realistic headshot adapter for Qwen/Qwen-Image, an image generation model by Qwen. It is an advanced LoRA adaptation of the Qwen-Image model and an upgraded version of prithivMLmods/Qwen-Image-Studio-Realism, offering more precise portrait rendering with a strong focus on realism. The model was trained on diverse face types from across the world, labeled with florence2-en and caption-optimized using prithivMLmods/DeepCaption-VLA-7B. 11(types) × 5 different face types: Asian, Hispanic, Caucasian, Latina, Middle Eastern, etc.

⮞ Model🤗: prithivMLmods/Qwen-Image-HeadshotX

⮞ The Previous Adapter (LoRA): prithivMLmods/Qwen-Image-Studio-Realism

⮞ Collection: prithivMLmods/qwen-image-exp-lora-68a978fe11400bc3165b0c4d

.
.
.
To know more about it, visit the app page or the respective model page!!

2 replies

·

prithivMLmods

posted an update 3 days ago

Post

3259

Comparing: DeepCaption-VLA-7B, built on Qwen2.5-VL-7B-Instruct, is tailored for image captioning and vision-language attribution, focusing on precise, descriptive captions of visual properties, object attributes, and scene details. In contrast, Qwen2.5-VL-7B-Abliterated-Caption-it is fine-tuned for abliterated captioning, generating highly detailed descriptions across diverse visual categories.

Models🤗
✦ DeepCaption-VLA-7B : prithivMLmods/DeepCaption-VLA-7B
✦ Qwen2.5-VL-7B-Abliterated-Caption-it : prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it

Spaces⛵
➜ VisionScope-R2 : prithivMLmods/VisionScope-R2
➜ Qwen2.5-VL-Outpost : prithivMLmods/Qwen2.5-VL-Outpost

Collection🗞️
DeepCaption attr. : prithivMLmods/deepcaption-attr-68b041172ebcb867e45c556a
VL Abliterated-Caption : prithivMLmods/vl-abliterated-caption-68a0443b63182e97a15c47a3
Multimodal VLMs - Until July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

GitHub↗️
> DeepCaption-VLA-7B [4bit-notebook demo] : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption-VLA-7B%5B4bit%20-%20notebook%20demo%5D/DeepCaption-VLA-7B.ipynb
> Qwen2.5-VL-3B-Abliterated-Caption-it(caption) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Qwen2.5-VL-3B-Abliterated-Caption-it(caption)/Qwen2_5_VL_3B_Abliterated_Caption_it.ipynb

The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀

To know more about it, visit the app page or the respective model page!!

prithivMLmods

posted an update 7 days ago

Post

5386

FastVLMs by Apple are the talk of the week for edge device VLMs and also for consumer-grade VLMs on the Hub. They have some impressive demos available on the Hub for live captioning and inference tasks. Meanwhile, I’m still exploring one of the coolest edge-device multimodal releases—Liquid AI’s LFM2-VL (450M and 1.6B). I’ve also made a live camera video inference demo, which is capable of running on Colab’s free-tier T4 GPU.

🤗Live Captioning Notebooks:
➠ LiquidAI LFM2 VL 1.6B Live Cam: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LiquidAI-LFM2-VL-Live-Cam/LiquidAI_LFM2_VL_1_6B_Live_Cam.ipynb

➠ LiquidAI LFM2 VL 450M Live Cam: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LiquidAI-LFM2-VL-Live-Cam/LiquidAI_LFM2_VL_450M_Live_Cam.ipynb

✨I also made a demo for the FastVLM Live Captioning Notebook.
➠ FastVLM 0.5B Live Cam: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Apple-FastVLM-0.5B-Live-Cam/apple_FastVLM_0_5B_live_cam.ipynb

↗️For more notebooks, kindly visit the following repositories.
➠ Multimodal Outpost Notebooks: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks

Feel free to fork, modify, and explore!

prithivMLmods

posted an update 11 days ago

Post

3401

Introducing prithivMLmods/DeepCaption-VLA-7B, a multimodal VLM designed for reasoning with long-shot captions (Captioning and Vision-Language Attribution). It focuses on defining visual properties, object attributes, and scene details across a wide spectrum of images and aspect ratios, generating attribute-rich image captions. The model supports creative, artistic, and technical applications that require detailed descriptions. 🤗🔥

✦︎ Models: prithivMLmods/DeepCaption-VLA-7B, also includes prithivMLmods/DeepAttriCap-VLA-3B, an experimental model for vision-language attribution.

✦︎ Try the demo here: prithivMLmods/VisionScope-R2

✦︎ Try it now on Google Colab, with support for T4 GPUs in 4-bit quant_type: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption-VLA-7B%5B4bit%20-%20notebook%20demo%5D/DeepCaption-VLA-7B.ipynb

✦︎ Collection: prithivMLmods/deepcaption-attr-68b041172ebcb867e45c556a

.
.
.

To know more about it, visit the model card of the respective model. !!

4 replies

·

prithivMLmods

posted an update 13 days ago

Post

1232

no flash_attention

prithivMLmods

posted an update 14 days ago

Post

5158

OpenGVLab's InternVL3_5-2B-MPO [Mixed Preference Optimization (MPO)] is a compact vision-language model in the InternVL3.5 series. You can now experience it in the Tiny VLMs Lab, an app featuring 15+ multimodal VLMs ranging from 250M to 4B parameters. These models support tasks such as OCR, reasoning, single-shot answering with small models, and captioning (including ablated variants), across a broad range of visual categories. They are also capable of handling images with complex, sensitive, or nuanced content, while adapting to varying aspect ratios and resolutions.

✨ Space/App : prithivMLmods/Tiny-VLMs-Lab
🫙 Model : OpenGVLab/InternVL3_5-2B-MPO
↗️ Collection: OpenGVLab/internvl35-68ac87bd52ebe953485927fb
🗞️ Paper : https://arxiv.org/pdf/2508.18265
↗️ Multimodal Space Collection : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

To learn more, visit the relevant spaces, collections, and model cards.

2 replies

·

prithivMLmods

posted an update 15 days ago

Post

454

Dropping new adapters for Qwen-Image, including Qwen-Image-Studio-Realism, Qwen-Image-Anime-LoRA, Qwen-Image-Sketch-Smudge, Qwen-Image-Synthetic-Face, and Qwen-Image-Fragmented-Portraiture, with various style intermix compatibilities. For more details, visit the model card.

⤷ Studio Realism : prithivMLmods/Qwen-Image-Studio-Realism
⤷ Image Anime LoRA : prithivMLmods/Qwen-Image-Anime-LoRA
⤷ Sketch Smudge : prithivMLmods/Qwen-Image-Sketch-Smudge
⤷ Synthetic Face : prithivMLmods/Qwen-Image-Synthetic-Face
⤷ Fragmented Portraiture : prithivMLmods/Qwen-Image-Fragmented-Portraiture

Try it here at
✦︎ Qwen-Image-LoRA-DLC : prithivMLmods/Qwen-Image-LoRA-DLC
✦︎ Qwen-Image-Diffusion : prithivMLmods/Qwen-Image-Diffusion

Collection
✦︎ Qwen-Image-Exp-LoRA : prithivMLmods/qwen-image-exp-lora-68a978fe11400bc3165b0c4d
✦︎ Image Gen Apps (Diffusion) - LastUpdated 08/18 : prithivMLmods/image-gen-apps-diffusion-lastupdated-08-18-68a2f4c5ef3e5e394eacc20a

.
.
.

To know more, visit the following spaces, collections, and model cards.

prithivMLmods

posted an update 22 days ago

Post

3424

Added plug-and-play support for Qwen Image LoRA! 🤗⚡

Try it here:
✦︎ Qwen-Image (with LoRA): prithivMLmods/Qwen-Image-Diffusion
✦︎ Collection: prithivMLmods/image-gen-apps-diffusion-lastupdated-08-18-68a2f4c5ef3e5e394eacc20a

prithivMLmods

posted an update 23 days ago

Post

4689

Excited to introduce the Tiny VLMs Lab App for experiencing 15+ multimodal VLMs, ranging from a 250M parameter model to a 4B parameter model, for tasks like OCR, reasoning, small models for single-shot answering, and captioning (abliterated), across a broad range of visual categories including images with complex, sensitive, or nuanced content, while handling varying aspect ratios and resolutions.🧪

🤗 Space/App: prithivMLmods/Tiny-VLMs-Lab

✦︎ Also introducing prithivMLmods/Qwen2.5-VL-3B-Abliterated-Caption-it, tailored for Abliterated Captioning / Uncensored Image Captioning. This release comes as a lighter alternative to the existing Qwen2.5-VL-7B-Abliterated-Caption-it prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it model, making it usable on mid-range GPUs and even experimental on T4 GPUs.

✦︎ Collection: prithivMLmods/vl-abliterated-caption-68a0443b63182e97a15c47a3
✦︎ GitHub: https://github.com/PRITHIVSAKTHIUR/Tiny-VLMs-Lab
.
.
.
To know more about it, visit the app page or the respective model page!!

prithivMLmods

posted an update 27 days ago

Post

3192

Try Liquid AI's all-new multimodal models: LFM2-VL-1.6B & LFM2-VL-450M! Demo with the Gradio UI and ReportLab support and both models are runnable on T4 GPU!

↗ LFM2-VL-1.6B-LiquidAI : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LFM2-VL-1.6B-LiquidAI/LFM2-VL-1.6B_ReportLab.ipynb

↗ LFM2-VL-450M-LiquidAI : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LFM2-VL-450M-LiquidAI/LFM2-VL-450M_ReportLab.ipynb

.
.
.
To know more about it, visit the multimodal outpost notebooks !!

1 reply

·

prithivMLmods

posted an update about 1 month ago

Post

4389

On the verge of releasing Poseidon-Reasoning-5M, a dataset built to excel in general thought processes, mathematics, and science across a diverse mixture of domains, I’m also dropping the Gargantua-R1-Compact dataset, a collection of over six million high-quality reasoning QA pair traces. 🤗🚀

✦ Gargantua-R1-Compact : prithivMLmods/Gargantua-R1-Compact

from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Gargantua-R1-Compact", split="train")

Additionally, I’m adding the mini version of Gargantua — the Gargantua-R1-Wee : prithivMLmods/Gargantua-R1-Wee

from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Gargantua-R1-Wee", split="train")

The composition spans 73.93% core mathematical reasoning involving problems, proofs, and computational challenges, 12.11% across diverse scientific domains such as physics, chemistry, biology, and interdisciplinary topics, 11.35% in competitive coding covering algorithms and data structures, 1.37% in academic science focusing on research-level methodology, 0.95% in creative and analytical reasoning through logic puzzles and problem-solving tasks, 0.25% in specialized technical areas like MLOps, LLMs, diffusion models, and CUDA, and 0.06% involving data from graphs and charts converted into structured JSON formats. Designed with both rich contextual depth and formal structural clarity, Gargantua-R1-Compact is an optimal resource for advancing research in symbolic reasoning, interpretability, and high-precision question answering in mathematical domains.

✦ Collection : prithivMLmods/gargantua-r1-mod-6896bfd7834e82b89ad2b38b

To know more about it, visit the dataset card of the respective dataset. !!

Jinruiy

updated a dataset about 1 month ago

unimelb-nlp/MultiEup-v2

Preview • Updated Aug 8 • 109 • 1

Jinruiy

published a dataset about 1 month ago

unimelb-nlp/MultiEup-v2

Preview • Updated Aug 8 • 109 • 1

Jinruiy

updated a dataset about 1 month ago

unimelb-nlp/EuroParlVote

Viewer • Updated Aug 8 • 22.1k • 94 • 1

prithivMLmods

posted an update about 1 month ago

Post

2226

I've added the demo of the openbmb/MiniCPM-V-4 model to the Hugging Face Space:
prithivMLmods/Multimodal-VLM-Thinking

✨ MiniCPM-V 4.0 is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B, with a total of 4.1B parameters. It inherits the strong single-image, multi-image, and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency.

✨ With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. This performance surpasses GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B parameters, OpenCompass 65.2), and Qwen2.5-VL-3B-Instruct (3.8B parameters, OpenCompass 64.5). It also shows good performance in multi-image and video understanding.

The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀

To know more about it, visit the model card of the respective model. !!

prithivMLmods

posted an update about 1 month ago

Post

4230

Qwen Image – The Latest Image Generation Model🔥

Below are some samples generated using the Qwen Image Diffusion Model. Qwen-Image, a 20B MMDiT model for next-generation text-to-image generation, preserves typographic details, layout coherence, and contextual harmony with stunning accuracy. It is especially strong at creating stunning graphic posters with native text. The model is now open-source. [ 𝚀𝚠𝚎𝚗-𝙸𝚖𝚊𝚐𝚎 : Qwen/Qwen-Image ]

⤷ Try the Qwen Image demo here: prithivMLmods/Qwen-Image-Diffusion

⤷ Qwen-Image Technical Report : Qwen-Image Technical Report (2508.02324)
⤷ Qwen Image [GitHub] : https://github.com/QwenLM/Qwen-Image

Even more impressively, it demonstrates a strong ability to understand images. The model supports a wide range of vision-related tasks such as object detection, semantic segmentation, depth and edge (Canny) estimation, novel view synthesis, and image super-resolution. While each task is technically distinct, they can all be viewed as advanced forms of intelligent image editing driven by deep visual understanding. Collectively, these capabilities position Qwen-Image as more than just a tool for generating appealing visuals, it serves as a versatile foundation model for intelligent visual creation and transformation, seamlessly blending language, layout, and imagery.

Qwen-Image uses a dual-stream MMDiT architecture with a frozen Qwen2.5-VL, VAE encoder, RMSNorm for QK-Norm, LayerNorm elsewhere, and a custom MSRoPE scheme for joint image-text positional encoding.

.
.
.
To know more about it, visit the model card of the respective model. !!

prithivMLmods

posted an update about 1 month ago

Post

3182

Introducing Camel-Doc-OCR-080125(v2), a document content-structure retrieval VLM designed for content extraction and summarization. This is the second model in the Camel Doc OCR VLM series, following Camel-Doc-OCR-062825(v1). The new version fixes formal table reconstruction issues in both en and zh language, achieving optimal performance for long-context inferences.🤗🐪

⤷ Camel-Doc-OCR(v2) : prithivMLmods/Camel-Doc-OCR-080125
⤷ Camel-Doc-OCR(v1) : prithivMLmods/Camel-Doc-OCR-062825
⤷ Demo : prithivMLmods/core-OCR

Multimodal Model Collections and Spaces:

➝ Camel-Doc-OCR : prithivMLmods/camel-doc-ocr-080125-688c0c61c5dba648756f31f8
➝ Vision-Language (VLr) : prithivMLmods/vision-language-for-reasoning-vlr-6889b3f45917352b5e3a6f7a
➝ Multimodal Spaces : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
➝ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the model card of the respective model. !!

2 replies

·

Jinruiy

published a dataset about 1 month ago

unimelb-nlp/EuroParlVote

Viewer • Updated Aug 8 • 22.1k • 94 • 1

prithivMLmods

posted an update about 1 month ago

Post

1094

Exciting to bring the explicitly grounded experimental reasoning model, Lumian-VLR-7B-Thinking, built on top of Qwen2.5-VL, featuring reasoning-aware trajectories with enhanced spatial perception. Along with this, we’ve also added a demo for the model while bringing some of the latest and most interesting models available on the hub to make full use of the remaining resources.

✨ Multimodal-VLM-Thinking : prithivMLmods/Multimodal-VLM-Thinking
✨ Multimodal-VLM-OCR : prithivMLmods/Multimodal-VLM-v1.0

✦ Models used in these spaces:

✨ Lumian-VLR-7B-Thinking : prithivMLmods/Lumian-VLR-7B-Thinking
✨ Enesidaon-VLR-7B-no-Thinking : prithivMLmods/Enesidaon-VLR-7B-no-Thinking
✨ GLM-4.1V-9B-Thinking : zai-org/GLM-4.1V-9B-Thinking
✨ DREX-062225-exp : prithivMLmods/DREX-062225-exp & more ...

✦ Multimodal Model Collections and Spaces:

✨ Vision-Language (VLr) : prithivMLmods/vision-language-for-reasoning-vlr-6889b3f45917352b5e3a6f7a
✨ Multimodal Spaces : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
✨ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the model card of the respective model. !!

prithivMLmods

posted an update about 1 month ago

Post

4839

Explore OCR, Captioning, and Visual Understanding with Cutting-Edge Models on Hugging Face. 🤗🧪

I’ve put together a collection of Google Colab notebooks to experiment with some of the most exciting models available on the Hugging Face Hub focused on OCR, image captioning, and visual understanding tasks. [Image-to-Text] / [Image-Text-to-Text]

> 📖 OCR-ReportLab-Notebooks : prithivMLmods/OCR-ReportLab-Notebooks

These notebooks are built for quick prototyping and run on free T4 GPUs, making them perfect for experimentation, testing ideas, or just exploring what’s possible with modern vision-language models.

Note: The experimental notebooks are compiled with models that fit within the T4 GPU (free-tier) limits. More models along with their notebooks will be added over time.

The University of Melbourne

AI & ML interests

Recent Activity

unimelb-nlp/MultiEup-v2

unimelb-nlp/MultiEup-v2

unimelb-nlp/EuroParlVote

unimelb-nlp/EuroParlVote

AI & ML interests

Recent Activity

Team members 59

unimelb-nlp's activity