AI & ML interests

Small LMs for small computers

prithivMLmods 
posted an update about 10 hours ago
view post
Post
640
Dropped the HeadshotX : a super-realistic headshot adapter for Qwen/Qwen-Image, an image generation model by Qwen. It is an advanced LoRA adaptation of the Qwen-Image model and an upgraded version of prithivMLmods/Qwen-Image-Studio-Realism, offering more precise portrait rendering with a strong focus on realism. The model was trained on diverse face types from across the world, labeled with florence2-en and caption-optimized using prithivMLmods/DeepCaption-VLA-7B. 11(types) × 5 different face types: Asian, Hispanic, Caucasian, Latina, Middle Eastern, etc.

⮞ Model🤗: prithivMLmods/Qwen-Image-HeadshotX

⮞ The Previous Adapter (LoRA): prithivMLmods/Qwen-Image-Studio-Realism

⮞ Collection: prithivMLmods/qwen-image-exp-lora-68a978fe11400bc3165b0c4d

.
.
.
To know more about it, visit the app page or the respective model page!!
  • 2 replies
·
Nymbo 
posted an update about 16 hours ago
view post
Post
122
I have a few updates to my MCP server I wanna share: New Memory tool, improvements to web search & speech generation.

# Memory_Manager Tool

We now have a Memory_Manager tool. Ask ChatGPT to write all its memories verbatim, then tell gpt-oss-20b to save each one using the tool, then take them anywhere! It stores memories in a memories.json file in the repo, no external database required.

The Memory_Manager tool is currently hidden from the HF space because it's intended for local use. It's enabled by providing a HF_READ_TOKEN in the env secrets, although it doesn't actually use the key for anything. There's probably a cleaner way of ensuring memory is only used locally, I'll come back to this.

# Fetch & Websearch

The Fetch_Webpage tool has been simplified a lot. It now converts the page to Markdown and returns the page with three length settings (Brief, Standard, Full). This is a lot more reliable than the old custom extraction method.

The Search_DuckDuckGo tool has a few small improvements. The input is easier for small models to get right, and the output is more readable.

# Speech Generation

I've added the remaining voices for Kokoro-82M, it now supports all 54 voices with all accents/languages.

I also removed the 30 second cap by making sure it computes all chunks in sequence, not just the first. I've tested it on outputs that are ~10 minutes long. Do note that when used as an MCP server, the tool will timeout after 1 minute, nothing I can do about that for right now.

# Other Thoughts

Lots of MCP use cases involve manipulating media (image editing, ASR, etc.). I've avoided adding tools like this so far for two reasons:

1. Most of these solutions would require assigning it a ZeroGPU slot.
2. The current process of uploading files like images to a Gradio space is still a bit rough. It's doable but requires additional tools.

Both of these points make it a bit painful for local usage. I'm open to suggestions for other tools that rely on text.
prithivMLmods 
posted an update 1 day ago
view post
Post
2199
Comparing: DeepCaption-VLA-7B, built on Qwen2.5-VL-7B-Instruct, is tailored for image captioning and vision-language attribution, focusing on precise, descriptive captions of visual properties, object attributes, and scene details. In contrast, Qwen2.5-VL-7B-Abliterated-Caption-it is fine-tuned for abliterated captioning, generating highly detailed descriptions across diverse visual categories.

Models🤗
✦ DeepCaption-VLA-7B : prithivMLmods/DeepCaption-VLA-7B
✦ Qwen2.5-VL-7B-Abliterated-Caption-it : prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it

Spaces⛵
➜ VisionScope-R2 : prithivMLmods/VisionScope-R2
➜ Qwen2.5-VL-Outpost : prithivMLmods/Qwen2.5-VL-Outpost

Collection🗞️
DeepCaption attr. : prithivMLmods/deepcaption-attr-68b041172ebcb867e45c556a
VL Abliterated-Caption : prithivMLmods/vl-abliterated-caption-68a0443b63182e97a15c47a3
Multimodal VLMs - Until July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

GitHub↗️
> DeepCaption-VLA-7B [4bit-notebook demo] : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption-VLA-7B%5B4bit%20-%20notebook%20demo%5D/DeepCaption-VLA-7B.ipynb
> Qwen2.5-VL-3B-Abliterated-Caption-it(caption) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Qwen2.5-VL-3B-Abliterated-Caption-it(caption)/Qwen2_5_VL_3B_Abliterated_Caption_it.ipynb

The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀

To know more about it, visit the app page or the respective model page!!
Tonic 
posted an update 3 days ago
view post
Post
266
🙋🏻‍♂️ Hey there folks ,

Just wanted to annouce 🏭SmolFactory : it's the quickest and best way to finetune SmolLM3 and GPT-OSS-20B on huggingface !

Basicaly it's an app you can run on huggingface by duplicating the space and running your training directly on huggingface GPUs .

It will help you basically select datasets and models, fine tune your model , make an experiment tracker you can use on your mobile phone , push all your model card and even automatically make a demo for you on huggingface so you can directly test it out when it's done !

check out the blog to learn more : https://huggingface.co/blog/Tonic/smolfactory

or just try the app directly :
Tonic/SmolFactory

you can vibe check the cool models I made :
French SmolLM3 : Tonic/Petite-LLM-3
Medical GPT-OSS : Tonic/med-gpt-oss-20b-demo

check out the model cards :
multilingual reasoner (gpt-oss) - Tonic/gpt-oss-20b-multilingual-reasoner
med-gpt-oss : Tonic/med-gpt-oss-20b
petite-elle-l-aime : Tonic/petite-elle-L-aime-3-sft

github repo if you like command line more than gradio : https://github.com/josephrp/smolfactory

drop some likes on these links it's really much appreciated !

feedback and PRs are welcome !
KnutJaegersberg 
posted an update 4 days ago
view post
Post
921
What's missing for AGI

Current transformer-based, self-supervised systems have driven massive gains, but important gaps remain on the path to AGI. Key missing pieces are continual, curiosity-driven learning; grounded multimodal perception; reliable, contextual long-term memory with forgetting; motivated (hot) executive control and dynamic attention; metacognition and coherent causal world-models; and robust fluid reasoning, planning and decision-making. Progress will require hybrid architectures (neuromorphic/Hebbian + gradients + symbolic modules), active-inference and intrinsic-motivation objectives, and new lifelong, embodied benchmarks to evaluate safety and competence.


https://huggingface.co/blog/KnutJaegersberg/whats-missing-for-agi-in-todays-tech-trajectories
prithivMLmods 
posted an update 5 days ago
view post
Post
5356
FastVLMs by Apple are the talk of the week for edge device VLMs and also for consumer-grade VLMs on the Hub. They have some impressive demos available on the Hub for live captioning and inference tasks. Meanwhile, I’m still exploring one of the coolest edge-device multimodal releases—Liquid AI’s LFM2-VL (450M and 1.6B). I’ve also made a live camera video inference demo, which is capable of running on Colab’s free-tier T4 GPU.

🤗Live Captioning Notebooks:
➠ LiquidAI LFM2 VL 1.6B Live Cam: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LiquidAI-LFM2-VL-Live-Cam/LiquidAI_LFM2_VL_1_6B_Live_Cam.ipynb

➠ LiquidAI LFM2 VL 450M Live Cam: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LiquidAI-LFM2-VL-Live-Cam/LiquidAI_LFM2_VL_450M_Live_Cam.ipynb

✨I also made a demo for the FastVLM Live Captioning Notebook.
➠ FastVLM 0.5B Live Cam: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Apple-FastVLM-0.5B-Live-Cam/apple_FastVLM_0_5B_live_cam.ipynb

↗️For more notebooks, kindly visit the following repositories.
➠ Multimodal Outpost Notebooks: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks

Feel free to fork, modify, and explore!
Locutusque 
posted an update 7 days ago
view post
Post
6678
🌲🍄 LLM Forest Orchestra: Turning Hidden States into Music

Hello everyone! I'm excited to introduce a new Space I've been developing called LLM Forest Orchestra. This project converts the hidden states and attention patterns of transformer models into layered MIDI compositions. The concept draws inspiration from mushrooms and mycelial networks in forests. Fungi create underground connections linking plants and trees, establishing what some call a "wood-wide web" where signals and nutrients travel. Researchers have discovered that these exchanges form patterns resembling rhythms and pulses. When translated appropriately, these patterns can become music.

Transformers operate through remarkably similar principles: tokens share signals via hidden states and attention heads. This Space transforms those invisible information flows into notes, chords, and rhythms, treating the model as a digital forest orchestra.

🎛 Features

* Two compute modes:
- Full model operates on a Hugging Face model (defaulting to unsloth/Qwen3-14B-Base).
- Mock latents provides a CPU-friendly option that simulates tensors for immediate experimentation.
* Musical controls: You can adjust scale selection, tempo grid, velocity range, instrument/role presets, and seed randomization.
* Output: The system generates .mid files compatible with DAWs and remixing workflows.

🌌 Why?

Neural networks already resemble unusual musical instruments: signals flow through them, patterns emerge organically, and careful observation reveals hidden melodies. This is analogous to the forest's secret orchestra of mushrooms and trees.

👉 Try it

Try the Space here: Locutusque/LLM-Forest-Orchestra. I'm excited to hear the sounds you can generate. Please share your created MIDIs or remixes in the comments. Let's explore how this hidden forest of transformers can sound together. 🌳🎶
prithivMLmods 
posted an update 9 days ago
view post
Post
3392
Introducing prithivMLmods/DeepCaption-VLA-7B, a multimodal VLM designed for reasoning with long-shot captions (Captioning and Vision-Language Attribution). It focuses on defining visual properties, object attributes, and scene details across a wide spectrum of images and aspect ratios, generating attribute-rich image captions. The model supports creative, artistic, and technical applications that require detailed descriptions. 🤗🔥

✦︎ Models: prithivMLmods/DeepCaption-VLA-7B, also includes prithivMLmods/DeepAttriCap-VLA-3B, an experimental model for vision-language attribution.

✦︎ Try the demo here: prithivMLmods/VisionScope-R2

✦︎ Try it now on Google Colab, with support for T4 GPUs in 4-bit quant_type: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption-VLA-7B%5B4bit%20-%20notebook%20demo%5D/DeepCaption-VLA-7B.ipynb

✦︎ Collection: prithivMLmods/deepcaption-attr-68b041172ebcb867e45c556a

.
.
.

To know more about it, visit the model card of the respective model. !!
  • 4 replies
·
prithivMLmods 
posted an update 11 days ago
view post
Post
1224
OpenGVLab's InternVL3.5 is a new family of open-source multimodal models that have advanced versatility, reasoning, and efficiency. I have created 𝐝𝐞𝐦𝐨 𝐧𝐨𝐭𝐞𝐛𝐨𝐨𝐤𝐬 for models ranging from 1B to 4B parameters, available in multiple versions (MPO, Instruct, Pre-trained) and in both "thinking" and "non-thinking" settings, with experimental compatibility for 𝐓𝐞𝐬𝐥𝐚 𝐓𝟒 GPUs.

➠InternVL3_5_2B_MPO_Thinking: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3.5-Thinking/1_InternVL3_5_2B_MPO_Thinking/1_InternVL3_5_2B_MPO_Thinking.ipynb
➠InternVL3_5_1B_Instruct_Thinking: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3.5-Thinking/2_InternVL3_5_1B_Instruct_Thinking/2_InternVL3_5_1B_Instruct_Thinking.ipynb

➠InternVL3_5-1B-MPO: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3_5-MPO/InternVL3_5-1B-MPO/InternVL3_5_1B_MPO.ipynb
➠InternVL3_5-2B-MPO: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/tree/main/InternVL-3.5-Notebook/InternVL3_5-MPO/InternVL3_5-2B-MPO

➠InternVL3_5-1B-Instruct: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3_5-Instruct/InternVL3_5-1B-Instruct/InternVL3_5_1B_Instruct.ipynb
➠InternVL3_5-2B-Instruct: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3_5-Instruct/InternVL3_5-2B-Instruct/InternVL3_5_2B_Instruct.ipynb

➠InternVL3_5-1B-Pretrained: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3_5-Pretrained/InternVL3_5-1B-Pretrained/InternVL3_5_1B_Pretrained.ipynb
➠InternVL3_5-2B-Pretrained: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3_5-Pretrained/InternVL3_5-2B-Pretrained/InternVL3_5_2B_Pretrained.ipynb

no flash_attention
prithivMLmods 
posted an update 12 days ago
view post
Post
5148
OpenGVLab's InternVL3_5-2B-MPO [Mixed Preference Optimization (MPO)] is a compact vision-language model in the InternVL3.5 series. You can now experience it in the Tiny VLMs Lab, an app featuring 15+ multimodal VLMs ranging from 250M to 4B parameters. These models support tasks such as OCR, reasoning, single-shot answering with small models, and captioning (including ablated variants), across a broad range of visual categories. They are also capable of handling images with complex, sensitive, or nuanced content, while adapting to varying aspect ratios and resolutions.

✨ Space/App : prithivMLmods/Tiny-VLMs-Lab
🫙 Model : OpenGVLab/InternVL3_5-2B-MPO
↗️ Collection: OpenGVLab/internvl35-68ac87bd52ebe953485927fb
🗞️ Paper : https://arxiv.org/pdf/2508.18265
↗️ Multimodal Space Collection : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

To learn more, visit the relevant spaces, collections, and model cards.
  • 2 replies
·
prithivMLmods 
posted an update 13 days ago
view post
Post
449
Dropping new adapters for Qwen-Image, including Qwen-Image-Studio-Realism, Qwen-Image-Anime-LoRA, Qwen-Image-Sketch-Smudge, Qwen-Image-Synthetic-Face, and Qwen-Image-Fragmented-Portraiture, with various style intermix compatibilities. For more details, visit the model card.

⤷ Studio Realism : prithivMLmods/Qwen-Image-Studio-Realism
⤷ Image Anime LoRA : prithivMLmods/Qwen-Image-Anime-LoRA
⤷ Sketch Smudge : prithivMLmods/Qwen-Image-Sketch-Smudge
⤷ Synthetic Face : prithivMLmods/Qwen-Image-Synthetic-Face
⤷ Fragmented Portraiture : prithivMLmods/Qwen-Image-Fragmented-Portraiture

Try it here at
✦︎ Qwen-Image-LoRA-DLC : prithivMLmods/Qwen-Image-LoRA-DLC
✦︎ Qwen-Image-Diffusion : prithivMLmods/Qwen-Image-Diffusion

Collection
✦︎ Qwen-Image-Exp-LoRA : prithivMLmods/qwen-image-exp-lora-68a978fe11400bc3165b0c4d
✦︎ Image Gen Apps (Diffusion) - LastUpdated 08/18 : prithivMLmods/image-gen-apps-diffusion-lastupdated-08-18-68a2f4c5ef3e5e394eacc20a

.
.
.

To know more, visit the following spaces, collections, and model cards.
Nymbo 
posted an update 13 days ago
view post
Post
718
I built a general use MCP space ~ Fetch webpages, DuckDuckGo search, Python code execution, Kokoro TTS, Image Gen, Video Gen.

# Tools

1. Fetch webpage
2. Web search via DuckDuckGo (very concise, low excess context)
3. Python code executor
4. Kokoro-82M speech generation
5. Image Generation (use any model from HF Inference Providers)
6. Video Generation (use any model from HF Inference Providers)

The first four tools can be used without any API keys whatsoever. DDG search is free and the code execution and speech gen is done on CPU. Having a HF_READ_TOKEN in the env variables will show all tools. If there isn't a key present, The Image/Video Gen tools are hidden.

Nymbo/Tools
prithivMLmods 
posted an update 20 days ago
prithivMLmods 
posted an update 22 days ago
view post
Post
4685
Excited to introduce the Tiny VLMs Lab App for experiencing 15+ multimodal VLMs, ranging from a 250M parameter model to a 4B parameter model, for tasks like OCR, reasoning, small models for single-shot answering, and captioning (abliterated), across a broad range of visual categories including images with complex, sensitive, or nuanced content, while handling varying aspect ratios and resolutions.🧪

🤗 Space/App: prithivMLmods/Tiny-VLMs-Lab

✦︎ Also introducing prithivMLmods/Qwen2.5-VL-3B-Abliterated-Caption-it, tailored for Abliterated Captioning / Uncensored Image Captioning. This release comes as a lighter alternative to the existing Qwen2.5-VL-7B-Abliterated-Caption-it prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it model, making it usable on mid-range GPUs and even experimental on T4 GPUs.

✦︎ Collection: prithivMLmods/vl-abliterated-caption-68a0443b63182e97a15c47a3
✦︎ GitHub: https://github.com/PRITHIVSAKTHIUR/Tiny-VLMs-Lab
.
.
.
To know more about it, visit the app page or the respective model page!!
Nymbo 
posted an update 22 days ago
view post
Post
961
Anyone using Jan-v1-4B for local MCP-based web search, I highly recommend you try out Intelligent-Internet/II-Search-4B

Very impressed with this lil guy and it deserves more downloads. It's based on the original version of Qwen3-4B but find that it questions reality way less often. Jan-v1 seems to think that everything it sees is synthetic data and constantly gaslights me
prithivMLmods 
posted an update 25 days ago
view post
Post
3189
Try Liquid AI's all-new multimodal models: LFM2-VL-1.6B & LFM2-VL-450M! Demo with the Gradio UI and ReportLab support and both models are runnable on T4 GPU!

↗ LFM2-VL-1.6B-LiquidAI : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LFM2-VL-1.6B-LiquidAI/LFM2-VL-1.6B_ReportLab.ipynb

↗ LFM2-VL-450M-LiquidAI : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LFM2-VL-450M-LiquidAI/LFM2-VL-450M_ReportLab.ipynb

.
.
.
To know more about it, visit the multimodal outpost notebooks !!
  • 1 reply
·
prithivMLmods 
posted an update 29 days ago
view post
Post
4386
On the verge of releasing Poseidon-Reasoning-5M, a dataset built to excel in general thought processes, mathematics, and science across a diverse mixture of domains, I’m also dropping the Gargantua-R1-Compact dataset, a collection of over six million high-quality reasoning QA pair traces. 🤗🚀

✦ Gargantua-R1-Compact : prithivMLmods/Gargantua-R1-Compact

from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Gargantua-R1-Compact", split="train")

Additionally, I’m adding the mini version of Gargantua — the Gargantua-R1-Wee : prithivMLmods/Gargantua-R1-Wee

from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Gargantua-R1-Wee", split="train")

The composition spans 73.93% core mathematical reasoning involving problems, proofs, and computational challenges, 12.11% across diverse scientific domains such as physics, chemistry, biology, and interdisciplinary topics, 11.35% in competitive coding covering algorithms and data structures, 1.37% in academic science focusing on research-level methodology, 0.95% in creative and analytical reasoning through logic puzzles and problem-solving tasks, 0.25% in specialized technical areas like MLOps, LLMs, diffusion models, and CUDA, and 0.06% involving data from graphs and charts converted into structured JSON formats. Designed with both rich contextual depth and formal structural clarity, Gargantua-R1-Compact is an optimal resource for advancing research in symbolic reasoning, interpretability, and high-precision question answering in mathematical domains.

✦ Collection : prithivMLmods/gargantua-r1-mod-6896bfd7834e82b89ad2b38b


To know more about it, visit the dataset card of the respective dataset. !!
prithivMLmods 
posted an update 30 days ago
view post
Post
2222
I've added the demo of the openbmb/MiniCPM-V-4 model to the Hugging Face Space:
prithivMLmods/Multimodal-VLM-Thinking

✨ MiniCPM-V 4.0 is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B, with a total of 4.1B parameters. It inherits the strong single-image, multi-image, and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency.

✨ With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. This performance surpasses GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B parameters, OpenCompass 65.2), and Qwen2.5-VL-3B-Instruct (3.8B parameters, OpenCompass 64.5). It also shows good performance in multi-image and video understanding.

The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀

To know more about it, visit the model card of the respective model. !!
prithivMLmods 
posted an update about 1 month ago
view post
Post
4230
Qwen Image – The Latest Image Generation Model🔥

Below are some samples generated using the Qwen Image Diffusion Model. Qwen-Image, a 20B MMDiT model for next-generation text-to-image generation, preserves typographic details, layout coherence, and contextual harmony with stunning accuracy. It is especially strong at creating stunning graphic posters with native text. The model is now open-source. [ 𝚀𝚠𝚎𝚗-𝙸𝚖𝚊𝚐𝚎 : Qwen/Qwen-Image ]

⤷ Try the Qwen Image demo here: prithivMLmods/Qwen-Image-Diffusion

⤷ Qwen-Image Technical Report : Qwen-Image Technical Report (2508.02324)
⤷ Qwen Image [GitHub] : https://github.com/QwenLM/Qwen-Image

Even more impressively, it demonstrates a strong ability to understand images. The model supports a wide range of vision-related tasks such as object detection, semantic segmentation, depth and edge (Canny) estimation, novel view synthesis, and image super-resolution. While each task is technically distinct, they can all be viewed as advanced forms of intelligent image editing driven by deep visual understanding. Collectively, these capabilities position Qwen-Image as more than just a tool for generating appealing visuals, it serves as a versatile foundation model for intelligent visual creation and transformation, seamlessly blending language, layout, and imagery.

Qwen-Image uses a dual-stream MMDiT architecture with a frozen Qwen2.5-VL, VAE encoder, RMSNorm for QK-Norm, LayerNorm elsewhere, and a custom MSRoPE scheme for joint image-text positional encoding.

.
.
.
To know more about it, visit the model card of the respective model. !!
Tonic 
posted an update about 1 month ago