M4-ai (M4-ai)

posted an update about 19 hours ago

Post

134

Do you think domain-specific embedding fine-tuners are needed?
I've been working with embeddings for marketing use cases and noticed something: most embeddings don't get marketing concepts very well. They're trained in general-purpose ways.
The Issue I'm Seeing
When I search marketing content with general embeddings:

"organic growth" returns farming articles
"conversion funnel" matches industrial equipment
"brand lift" doesn't connect to campaign effectiveness
Marketing jargon like CAC, ROAS, CTR aren't properly understood

My Question
Do you think domain-specific embeddings are needed for marketing?
Some thoughts:

Marketing has its own vocabulary and concept relationships
General models trained on Wikipedia/web crawl miss these nuances
But is fine-tuning worth the effort vs just using more retrieval tricks?

Quick Example
I fine-tuned all-mpnet-base-v2 on ~1000 marketing concept pairs and saw 15-20% better retrieval accuracy. But I'm curious:

Has anyone else tried this for marketing or other domains?
When do you think domain-specific embeddings are actually necessary vs overkill?
Are there better approaches I'm missing?

https://huggingface.co/blog/Sri-Vigneshwar-DJ/why-your-marketing-rag-system-needs-domain-specifi

Nymbo

posted an update 3 days ago

Post

270

I have a few Sora-2 invites - 15509N

Sri-Vigneshwar-DJ

posted an update 3 days ago

Post

4331

🚀 Exciting News! We've released a Performance Marketing Expert Dataset from Hawky.ai [www.hawky.ai]

Hawky-ai

This dataset empowers AI models with cutting-edge strategies for Meta, Google Ads, and TikTok campaigns. It includes:
1. Multi-platform strategies for e-commerce, DTC, B2B, and more
2. Creative optimization and audience targeting insights
3. ROI-driven recommendations based on 2025 best practices

Sri-Vigneshwar-DJ/Performance-Marketing-Data

prithivMLmods

posted an update 4 days ago

Post

4385

Try the Hugging Face Space demo for Logics-MLLM/Logics-Parsing, the latest multimodal VLM from the Logics Team at Alibaba Group. It enables end-to-end document parsing with precise content extraction in markdown format, and it also generates a clean HTML representation of the document while preserving its logical structure. 🤗🔥

Additionally, I’ve integrated one of my recent works — prithivMLmods/Gliese-OCR-7B-Post1.0 — which also excels at document comprehension.

⭐ Space / App : prithivMLmods/VLM-Parsing
📄 Technical Report by the Logics Team, Alibaba Group : Logics-Parsing Technical Report (2509.19760)
🖖 MM: VLM-Parsing: prithivMLmods/mm-vlm-parsing-68e33e52bfb9ae60b50602dc
⚡ Collections : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Other Pages:

➔ Multimodal VLMs - July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
➔ Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd
➔ VL caption — < Sep 15 ’25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391

.
.
.
To know more about it, visit the app page or the respective model page!!

Sri-Vigneshwar-DJ

posted an update 6 days ago

Post

3280

🚀 Qwen3-Omni for Marketing: A Game-Changer

Just wanted to share something exciting I've been exploring—Qwen3-Omni and how it's transforming marketing workflows.

What makes it special? At Hawky.ai we are started experimenting with Qwen3 recently for Analysis and Optimization.

Unlike traditional tools that look at text, images, or audio separately, Qwen3-Omni analyzes everything together. It handles 119 languages, processes 40-minute audio sequences, and understands both images and videos—all at once.

The cool part? It's 2-3x faster than similar models thanks to its MoE architecture.

Real applications I'm seeing:
Ad Analysis: It scores video ads by combining visual elements, audio tone, and text—giving 25% better CTR predictions than single-mode tools.
Campaign Localization: Drop in one ad, get 10 localized versions with native voiceovers in under a minute. Perfect for testing across markets.

Market Research: Feed it competitor content, podcasts, or UGC videos. It extracts actionable insights like "3-second hooks boost retention by 15%" and saves about 70% of analysis time.

Quality Checks: Automatically catches lip-sync errors and audio-visual mismatches.

Full technical breakdown: https://huggingface.co/blog/Sri-Vigneshwar-DJ/hawky-aiqwen3-omni-advanced-architecture-and-marke

Has anyone else been experimenting with multimodal models for marketing? Would love to hear what you're building!

#MultimodalAI #MarTech #OpenSource

prithivMLmods

posted an update 7 days ago

Post

1144

Try Banana Zoom an advanced image enhancement web app that lets users select regions of an image for AI-powered upscaling and detail refinement. Using Google’s (nano banana), it analyzes selections, generates context-aware enhancements, and produces high-resolution outputs. Simply drag-and-drop or upload images, make precise or fixed-size selections, and watch improvements in real-time with smooth zoom and pixel-dissolve effects.

Space / App: prithivMLmods/Banana-Zoom
Collection: https://huggingface.co/collections/prithivMLmods/image-gen-apps-diffusion-lastupdated-09-23-68a2f4c5ef3e5e394eacc20a
GitHub: https://github.com/prithivsakthiur/banana-zoom

Your API will be automatically destroyed once you refresh the app or exit it, so each user's API will be cycled in this way.

prithivMLmods

posted an update 13 days ago

Post

4368

Photo-Mate-i2i – a space for experimenting with adapters for image manipulation using Kontext adapters, including Photo-Restore-i2i, PhotoCleanser-i2i, Polaroid-Warm-i2i, Yarn-Photo-i2i, Monochrome-Pencil, and more. Try out the demo, and to learn more, visit the app page or the respective model pages!

⚡Demo: prithivMLmods/Photo-Mate-i2i
⚙️How to Use: prithivMLmods/Photo-Mate-i2i#2
👨‍🔧i2i-Kontext(Experimental LoRAs): prithivMLmods/i2i-kontext-exp-68ce573b5c0623476b636ec7

Tonic

posted an update 14 days ago

Post

481

the french ministry of culture releases their first conversation datasets on huggingface 👇🏻
ministere-culture/comparia-conversations

KnutJaegersberg

posted an update 14 days ago

Post

337

The Formative Mind: Theories of Consciousness as Practice

Instead of treating consciousness as a passive byproduct of a powerful unconscious engine, think of it as the engine itself: a process that builds rich representations (self-organizing), predicts and models its own processing (metarepresentation), and thereby brings an agent and its world into being (individuation). A brief synthesis.

https://huggingface.co/blog/KnutJaegersberg/formative-mind

prithivMLmods

posted an update 15 days ago

Post

5161

Dropping some experimental adapters for FLUX.1-Kontext-dev, including Photo-Restore-i2i, PhotoCleanser-i2i, Polaroid-Warm-i2i, Yarn-Photo-i2i, and Monochrome-Pencil. These were trained under various settings with minimal image pairs to achieve optimal results. The dataset result sets end pairs were synthesized using Gemini-2.5-Flash-Image-Preview and others.🤗✨

prithivMLmods/PhotoCleanser-i2i: Remove objects while preserving the rest of the image.
prithivMLmods/Photo-Restore-i2i: Restore old photos into moderately colorized, detailed images.
prithivMLmods/Polaroid-Warm-i2i: Seamless vintage Polaroid-style images with warm, faded tones.
prithivMLmods/Yarn-Photo-i2i: Convert images into yarn-stitched artwork while retaining key details.
prithivMLmods/Monochrome-Pencil: Turn images into monochrome pencil sketches while keeping original features.

✨Note: All the above models share the same auto-labeling multimodal VLM captioning model, prithivMLmods/DeepCaption-VLA-7B, which is used for refining edit instructions and accurately understanding attributions for the generations.

✨Collection: prithivMLmods/i2i-kontext-exp-68ce573b5c0623476b636ec7

.
.
.
To know more about it, visit the app page or the respective model page!!

Nymbo

posted an update 19 days ago

Post

880

There's now a custom Deep_Research tool in my Nymbo/Tools MCP server! TL;DR: The agent using the tools writes a summary of your requests and up to five DuckDuckGo searches (up to 50 results). Each of the webpages found in the searches are then fetched and given to our researcher (Qwen3-235B-A22B-Thinking-2507). The researcher sees the summary, searched queries, and fetched links, then writes a thorough research report. The agent using the tool provides the user with a summary of the report and a link to download research_report.txt. The researcher's instructions are similar to some leaked Perplexity sys prompts.

# Deep_Research Tool

It accomplishes everything in under a minute so it doesn't hit MCP's 60 second timeout, mostly thanks to Cerebras. The only thing required to make this work is a HF_READ_TOKEN for inference.

The Deep_Research tool could certainly be improved. It still needs some sort of mechanism for sorting URLs based on importance (I've got some ideas but I don't want it to be the responsibility of the agent using the tool). I'll probably add a second researcher to filter out the bad sources before inferencing the big researcher. I'm hellbent on keeping this all within the scope of one tool call.

# More Fetch/Web Search Improvements

The Search_DuckDuckGo tool has been further enhanced. It now allows the agent to browse through all pages of results. The results also now include published date (if detected). It also now supports every DDG search types! Default DDG search is called text, but it can also now search by news, images, videos, and books.

The Fetch_Webpage tool now specifies how much of the page has been truncated, and cursor index, allowing it to pickup where it left off without re-consuming tokens. The model can now also choose to strip CSS selectors to remove excess noise, and there's a new URL Scraper mode that only returns URLs found on the full page.

More to come soon ~

prithivMLmods

posted an update 19 days ago

Post

1571

Many of 'em pinged me asking to make the nano-banana-aio to available on hf.co/spaces, so I’ve transferred the app’s tech stack to make it compatible for deployment on Spaces. (Can be accessed with your own Gemini API) 🤗⭐️

✦ Yes, it is now available on Spaces: prithivMLmods/Nano-Banana-AIO

Nano Banana AIO (All-in-One) App, which offers seamless image manipulation features, including single/multiple image adaptation, a canvas for free-style drawing to creative image generation, and standard text-to-image generation.

All in One Banana for you! 😉

Tonic

posted an update 20 days ago

Post

597

COMPUTER CONTROL IS ON-DEVICE !

🏡🤖 78 % of EU smart-home owners DON’T trust cloud voice assistants.

So we killed the cloud.

Meet Exté: a palm-sized Android device that sees, hears & speaks your language - 100 % offline, 0 % data sent anywhere.

🔓 We submitted our technologies for consideration to the Liquid AI hackathon.

📊 Dataset: 79 k UI-action pairs on Hugging Face (largest Android-control corpus ever) Tonic/android-operator-episodes

⚡ Model: 98 % task accuracy, 678MB compressed , fits on existing android devices ! Tonic/l-android-control

🛤️ Experiment Tracker : check out the training on our TrackioApp Tonic/l-android-control

🎮 Live Model Demo: Upload an Android Screenshot and instructions to see the model in action ! Tonic/l-operator-demo

Built in a garage, funded by pre-orders, no VC. Now we’re scaling to 1 k installer units.

We’re giving 50 limited-edition prototypes to investors , installers & researchers who want to co-design the sovereign smart home.

👇 Drop “EUSKERA” in the comments if you want an invite, tag a friend who still thinks Alexa is “convenient,” and smash ♥️ if AI should belong to people - not servers.

prithivMLmods

posted an update 20 days ago

Post

3061

I'm a Hugging Face Fellow now, guys!🤗❤️

With the same passion, trust, and momentum to contribute to the community, I’m excited to do some amazing things to wrap up Q3 and Q4 of 2025. And importantly, I’ve been lucky enough to receive some knowledge and guidance from @merve to build open-source demos and stuff. Thank you for the belief.

Thank you — much love.
Long live open source!

— Prithiv

prithivMLmods

posted an update 23 days ago

Post

7159

Introducing Gliese-OCR-7B-Post1.0, a document content-structure retrieval VLM designed for content extraction(OCRs) and summarization. This is the third model in the Camel Doc OCR VLM series, following Camel-Doc-OCR-062825. The new version fixes formal table reconstruction issues in both En and Zh, achieving optimal performance for long-context inferences. This model also shows significant improvements in LaTeX and Markdown rendering for OCR tasks.

🤗 Gliese-OCR-7B-Post1.0 : prithivMLmods/Gliese-OCR-7B-Post1.0
📌 Gliese-Post1.0 Collection : prithivMLmods/gliese-post10-68c52c4a6ca4935f5259a6d7
⬅️ Previous Versions : prithivMLmods/Camel-Doc-OCR-062825
🧨 Gliese-OCR-7B-Post1.0 (4-bit) Notebook Demo on T4 : prithivMLmods/Gliese-OCR-7B-Post1.0
📖 GitHub [Gliese-OCR-7B-Post1.0(4-bit)-reportlab] : https://tinyurl.com/ys7zuerc

Other Collections:

➔ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
➔ Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd
➔ Multimodal VLMs - July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the app page or the respective model page!!

2 replies

·

prithivMLmods

posted an update 26 days ago

Post

3069

The POINTS-Reader, a vision-language model for end-to-end document conversion, is a powerful, distillation-free Vision-Language Model that sets new SoTA benchmarks. The demo is now available on HF (Extraction, Preview, Documentation). The input consists of a fixed prompt and a document image, while the output contains only a string (the text extracted from the document image). 🔥🤗

✦ Space/App: prithivMLmods/POINTS-Reader-OCR
✦ Model: tencent/POINTS-Reader
✦ Paper: https://arxiv.org/pdf/2509.01215

🤗 The app is done and ready to go brrrr with zero GPU. Thankyou @merve

.
.
.
To know more about it, visit the app page or the respective model page!!

4 replies

·

prithivMLmods

posted an update 27 days ago

Post

3857

Build something cool with Nano Banana aka Gemini 2.5 Flash Image AIO [All-in-One]. Draw and transform on canvas, edit images, and generate images—all in one place!🍌

✦︎ Constructed with the Gemini API (GCP). Try it here: prithivMLmods/Nano-Banana-AIO (Added the Space recently! - Sep 18 '25)

4 replies

·

Abhaykoul

posted an update 27 days ago

Post

2782

🚀 Ever dreamed of training your own Large Language Model from scratch? What if I told you it doesn't require a supercomputer or PhD in ML? 🤯

Introducing LLM Trainer - the educational framework that makes LLM training accessible to EVERYONE! Whether you're on a CPU-only laptop or scaling to distributed GPUs, we've got you covered. 💻➡️🖥️

Why LLM Trainer? Because existing tools are either too simplistic (hiding the magic) or too complex (requiring expert knowledge). We bridge the gap with:

🎓 Educational transparency - every component built from scratch with clear code
💻 CPU-first approach - start training immediately, no GPU needed
🔧 Full customization - modify anything you want
📈 Seamless scaling - from laptop to cluster without code changes
🤝 HuggingFace integration - works with existing models & tokenizers

Key highlights:
✅ Built-in tokenizers (BPE, WordPiece, HF wrappers)
✅ Complete Transformer implementation from scratch
✅ Optimized for CPU training
✅ Advanced features: mixed precision, gradient checkpointing, multiple generation strategies
✅ Comprehensive monitoring & metrics

Perfect for:
- Students learning transformers
- Researchers prototyping new ideas
- Developers building domain-specific models

Ready to train your first LLM? It's easier than you think!

🔗 Check it out: https://github.com/HelpingAI/llm-trainer
📚 Docs: Getting Started Guide
💬 Join the community: GitHub Discussions

#AI #MachineLearning #LLM #DeepLearning #OpenSource #Python #HuggingFace #NLP

Special thanks to HuggingFace and PyTorch teams for the amazing ecosystem! 🙏

1 reply

·

prithivMLmods

posted an update 30 days ago

Post

6444

Dropped the HeadshotX : a super-realistic headshot adapter for Qwen/Qwen-Image, an image generation model by Qwen. It is an advanced LoRA adaptation of the Qwen-Image model and an upgraded version of prithivMLmods/Qwen-Image-Studio-Realism, offering more precise portrait rendering with a strong focus on realism. The model was trained on diverse face types from across the world, labeled with florence2-en and caption-optimized using prithivMLmods/DeepCaption-VLA-7B. 11(types) × 5 different face types: Asian, Hispanic, Caucasian, Latina, Middle Eastern, etc.

⮞ Model🤗: prithivMLmods/Qwen-Image-HeadshotX

⮞ The Previous Adapter (LoRA): prithivMLmods/Qwen-Image-Studio-Realism

⮞ Collection: prithivMLmods/qwen-image-exp-lora-68a978fe11400bc3165b0c4d

.
.
.
To know more about it, visit the app page or the respective model page!!

2 replies

·

Nymbo

posted an update 30 days ago

Post

983

I have a few updates to my MCP server I wanna share: New Memory tool, improvements to web search & speech generation.

# Memory_Manager Tool

We now have a Memory_Manager tool. Ask ChatGPT to write all its memories verbatim, then tell gpt-oss-20b to save each one using the tool, then take them anywhere! It stores memories in a memories.json file in the repo, no external database required.

The Memory_Manager tool is currently hidden from the HF space because it's intended for local use. It's enabled by providing a HF_READ_TOKEN in the env secrets, although it doesn't actually use the key for anything. There's probably a cleaner way of ensuring memory is only used locally, I'll come back to this.

# Fetch & Websearch

The Fetch_Webpage tool has been simplified a lot. It now converts the page to Markdown and returns the page with three length settings (Brief, Standard, Full). This is a lot more reliable than the old custom extraction method.

The Search_DuckDuckGo tool has a few small improvements. The input is easier for small models to get right, and the output is more readable.

# Speech Generation

I've added the remaining voices for Kokoro-82M, it now supports all 54 voices with all accents/languages.

I also removed the 30 second cap by making sure it computes all chunks in sequence, not just the first. I've tested it on outputs that are ~10 minutes long. Do note that when used as an MCP server, the tool will timeout after 1 minute, nothing I can do about that for right now.

# Other Thoughts

Lots of MCP use cases involve manipulating media (image editing, ASR, etc.). I've avoided adding tools like this so far for two reasons:

1. Most of these solutions would require assigning it a ZeroGPU slot.
2. The current process of uploading files like images to a Gradio space is still a bit rough. It's doable but requires additional tools.

Both of these points make it a bit painful for local usage. I'm open to suggestions for other tools that rely on text.

AI & ML interests

Team members 37

M4-ai's activity