AI & ML interests

SLPL stands for Speech and Language Processing Lab. This lab is located in Sharif University of Technology Computer Engineering Department. Speech Recognition and Natural Language Processing are the major fields of activity for this group.

Recent Activity

sadrasabouriΒ  updated a dataset 9 months ago
SLPL/syntran-fa
z-rahimi-rΒ  updated a model about 1 year ago
SLPL/Hubert-base-ShEMO
View all activity

prithivMLmodsΒ 
posted an update about 1 hour ago
view post
Post
Explore OCR, Captioning, and Visual Understanding with Cutting-Edge Models on Hugging Face. πŸ€—πŸ§ͺ

I’ve put together a collection of Google Colab notebooks to experiment with some of the most exciting models available on the Hugging Face Hub focused on OCR, image captioning, and visual understanding tasks. [Image-to-Text] / [Image-Text-to-Text]

> πŸ“– OCR-ReportLab-Notebooks : prithivMLmods/OCR-ReportLab-Notebooks

These notebooks are built for quick prototyping and run on free T4 GPUs, making them perfect for experimentation, testing ideas, or just exploring what’s possible with modern vision-language models.

Note: The experimental notebooks are compiled with models that fit within the T4 GPU (free-tier) limits. More models along with their notebooks will be added over time.
prithivMLmodsΒ 
posted an update 3 days ago
view post
Post
2094
Excited to introduce the new experimental model "Qwen2.5-VL-7B-Abliterated-Caption-it", which is performing exceptionally well on image captioning tasks. This variant is specifically tailored for Abliterated Captioning and Uncensored Image Captioning. It is designed to generate highly detailed and descriptive captions across a broad range of visual categories including images with complex, sensitive, or nuanced content while handling varying aspect ratios and resolutions.πŸ§ͺπŸ€—

✨ Try the demo here : prithivMLmods/Qwen2.5-VL
✨ Qwen2.5-VL-7B-Abliterated-Caption-it : prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it
✨ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
✨ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update 4 days ago
view post
Post
2282
olmOCR [Allen AI] just got an upgrade! πŸ“ˆπŸ§‘β€πŸ³

The allenai/olmOCR-7B-0725 β€” fine-tuned with allenai/olmOCR-mix-0225 on top of Qwen/Qwen2.5-VL-7B-Instruct, pushing the boundaries of OCR technology. It takes a single document image as input, with the longest side resized to 1288 pixels. High-quality, openly available approach to parsing pdfs and other complex documents optical character recognition.

Try the demo here: prithivMLmods/Multimodal-OCR

✨ Model: allenai/olmOCR-7B-0725
✨ Model [fp8]: allenai/olmOCR-7B-0725-FP8
✨ Multimodal Implementations Space Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update 7 days ago
view post
Post
5022
Upgraded the step-by-step notebook for fine-tuning SigLIP2 on domain-specific image classification tasks. The notebook supports both datasets with predefined train/test splits and those with only a train split, making it suitable for low-resource, custom, and real-world classification scenarios. πŸ“’πŸ‘‰

➺ FineTuning-SigLIP2-Notebook : prithivMLmods/FineTuning-SigLIP2-Notebook

➺ GitHub : https://github.com/PRITHIVSAKTHIUR/FineTuning-SigLIP-2

➺ In the first, datasets include predefined train and test splits, enabling conventional supervised learning and generalization evaluation : prithivMLmods/FineTuning-SigLIP2-Notebook (.ipynb)

➺ In the second scenario, only a training split is available; in such cases, the training set is either partially reserved for validation or reused entirely for evaluation : prithivMLmods/FineTuning-SigLIP2-Notebook (.ipynb)

This flexibility supports experimentation in constrained or domain-specific settings, where standard test annotations may not exist.
prithivMLmodsΒ 
posted an update 9 days ago
view post
Post
4028
Dropping the general-purpose reasoning dataset Poseidon-Reasoning-5M, which supports general thought processes, math, and science β€” featuring a diverse mixture of domains 🌊 : prithivMLmods/Poseidon-Reasoning-5M

from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Poseidon-Reasoning-5M", split="data")

The compact version is as follows β€” Poseidon-Reasoning-Mini-300K : prithivMLmods/Poseidon-Reasoning-Mini-300K


from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Poseidon-Reasoning-Mini-300K", split="train")


Collection : prithivMLmods/poseidon-reasoning-6879ca98e118b307c781a9ba
prithivMLmodsΒ 
posted an update 12 days ago
view post
Post
2151
Open Omega Ξ© (Forge, Atom, Explora):
A Fusion of Math, Science, and Coding πŸ§ͺπŸ€—

Datasets :
⌯⌲ Open-Omega-Forge-1M [Mathematics, Coding, and Science]: prithivMLmods/Open-Omega-Forge-1M
⌯⌲ Open-Omega-Atom-1.5M [Mathematics and Science]: prithivMLmods/Open-Omega-Atom-1.5M
⌯⌲ Open-Omega-Explora-2.5M [Forge + Atom]: prithivMLmods/Open-Omega-Explora-2.5M
⌯⌲ Others [Subordinate portion] - Curated and blended modular dataset.

Models :
> Omega-Qwen3-Atom-8B : prithivMLmods/Omega-Qwen3-Atom-8B
> Omega-Qwen2.5-Coder-3B : prithivMLmods/Omega-Qwen2.5-Coder-3B

Dataset Collection: prithivMLmods/open-omega-a-fusion-of-math-science-and-coding-68756c37769fa39c4055cc0e

.
.
.
For more information, refer to the dataset card(s).

prithivMLmodsΒ 
posted an update 14 days ago
view post
Post
3807
Excited to bring the new models that are performing exceptionally well in document OCR, image captioning, and visual understanding tasks. Megalodon-OCR and Perseus-Doc-VL have both demonstrated significant improvements across key areas. You can explore live demos on Hugging Face Spaces to compare their performance with other top-tier models available on the hub. πŸ€—πŸ“„

Models & Spaces :
> Megalodon-OCR (3B) : prithivMLmods/Megalodon-OCR-Sync-0713
> Perseus-Doc-vl (7B): prithivMLmods/Perseus-Doc-vl-0712
> Doc-VLMs-OCR : prithivMLmods/Doc-VLMs-OCR
> core-OCR : prithivMLmods/core-OCR


Datasets Caption Mix :
> Corvus-OCR-Caption-Mix : prithivMLmods/Corvus-OCR-Caption-Mix
> Corvus-OCR-Caption-Mini-Mix : prithivMLmods/Corvus-OCR-Caption-Mini-Mix

Collections :
> Corvus OCR Caption Mix: prithivMLmods/corvus-ocr-caption-mix-687349bfaceffbd10976f0cc
> Captioning / OCR / DocTable : prithivMLmods/captioning-ocr-doctable-687382e1da822008bb5c06f2

GitHub :
> OCR-ReportLab : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab/blob/main/Megalodon-OCR-Sync-0713-ColabNotebook/Megalodon_OCR_Sync_0713_ReportLab.ipynb

Others Spaces :
> Multimodal-OCR : prithivMLmods/Multimodal-OCR
> Multimodal-VLMs : prithivMLmods/Multimodal-VLMs
> Multimodal-OCR2 : prithivMLmods/Multimodal-OCR2
> Florence-2-Image-Caption : prithivMLmods/Florence-2-Image-Caption
> VisionScope-R2 : prithivMLmods/VisionScope-R2
> DocScope-R1 : prithivMLmods/DocScope-R1

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update 18 days ago
view post
Post
2378
Demo of OCR & Math QA using multi-capable VLMs like MonkeyOCR-pro-1.2B, R1-One-Vision, VisionaryR1, Vision Matters-7B, and VIGAL-7B, all running together with support for both image and video inference. πŸͺ

✦ Demo Spaces :
β€· Multimodal VLMs : prithivMLmods/Multimodal-VLMs

✦ Models :
β€· Visionary R1 : maifoundations/Visionary-R1
β€· MonkeyOCR [1.2B] : echo840/MonkeyOCR-pro-1.2B
β€· ViGaL 7B : yunfeixie/ViGaL-7B
β€· Lh41-1042-Magellanic-7B-0711 : prithivMLmods/Lh41-1042-Magellanic-7B-0711
β€· Vision Matters 7B : Yuting6/Vision-Matters-7B
β€· WR30a-Deep-7B-0711 : prithivMLmods/WR30a-Deep-7B-0711

✦ MonkeyOCR-pro-1.2B Colab T4 Demo [ notebook ]
β€· MonkeyOCR-pro-1.2B-ReportLab : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab/blob/main/MonkeyOCR-0709/MonkeyOCR-pro-1.2B-ReportLab.ipynb

✦ GitHub : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab

The community GPU grant was given by Hugging Face β€” special thanks to them.πŸ€—πŸš€

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update 24 days ago
view post
Post
3547
Multimodal OCR with ReportLab? On Colab T4? (Nanonets OCR, Monkey OCR, OCRFlux 3B, Typhoo OCR 3B?) .. Yeah, it’s possible. I’ve made a dedicated Colab notebook to experiment with these models (all built on top of Qwen2.5 VL). πŸ€—πŸš€

Download notebooks here :

✦︎ NanonetsOCR : https://colab.research.google.com/drive/1VvA-amvSVxGdWgIsh4_by6KWOtEs_Iqp
✦︎ MonkeyOCR : https://colab.research.google.com/drive/1vPCojbmlXjDFUt06FJ1tjgnj_zWK4mUo
✦︎ OCRFluxOCR : https://colab.research.google.com/drive/1TDoCXzWdF2hxVLbISqW6DjXAzOyI7pzf
✦︎ TyphoonOCR : https://colab.research.google.com/drive/1_59zvLNnn1kvbiSFxzA1WiqhpbW8RKbz

🜲 Github : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab-Notebooks

What does it do?

1. Performs OCR on the input image
2. Generates a DOCX or PDF file with the input image and the extracted text

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update 26 days ago
view post
Post
1683
The bunch of comparable demos for Multimodal VLMs (excels in OCR, cinematography understanding, spatial reasoning, etc.) now up on the Hub πŸ€— β€” max recent till Jun'25.

✦ Demo Spaces β€”

> [Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B, SmolDocling] : prithivMLmods/Multimodal-OCR2
> [GLM-4.1v, docscopeOCR-7B, MonkeyOCR, coreOCR-7B] : prithivMLmods/core-OCR
> [Camel-Doc-OCR, ViLaSR-7B, OCRFlux-3B, ShotVL-7B] : prithivMLmods/Doc-VLMs-v2-Localization
> [SkyCaptioner-V1, SpaceThinker-3B, coreOCR-7B, SpaceOm-3B] : prithivMLmods/VisionScope-R2
> [RolmOCR-7B, Qwen2-VL-OCR-2B, Aya-Vision-8B, Nanonets-OCR-s] : prithivMLmods/Multimodal-OCR
> [DREX-062225-7B, Typhoon-OCR-3B, olmOCR-7B-0225, VIREX-062225-7B] : prithivMLmods/Doc-VLMs-OCR
> [Cosmos-Reason1-7B, docscopeOCR-7B, Captioner-7B, visionOCR-3B] : prithivMLmods/DocScope-R1

✦ Space Collection : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
  • 1 reply
Β·
prithivMLmodsΒ 
posted an update 27 days ago
view post
Post
2435
The demo for Camel-Doc-OCR-062825 (exp) is optimized for document retrieval and direct Markdown (.md) generation from images and PDFs. Additional demos include OCRFlux-3B (document OCR), VilaSR (spatial reasoning with visual drawing), and ShotVL (cinematic language understanding). πŸͺ

✦ Space : prithivMLmods/Doc-VLMs-v2-Localization

Models :
β€· camel-doc-ocr-062825 : prithivMLmods/Camel-Doc-OCR-062825
β€· ocrflux-3b : ChatDOC/OCRFlux-3B
β€· vilasr : AntResearchNLP/ViLaSR
β€· shotvl : Vchitect/ShotVL-7B

β€· Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

The community GPU grant was given by Hugging Face β€” special thanks to them. This space supports the following tasks: (image inference, video inference) with result markdown canvas and object detection/localization. πŸ€—πŸš€

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update about 1 month ago
view post
Post
1994
The demo for DREX-062225-exp (Document Retrieval and Extraction eXpert ~ experimental) / typhoon-ocr-3b (a bilingual document parsing model built specifically for real-world documents) / VIREX-062225-exp (Video Information Retrieval and Extraction eXpert ~ experimental) / olmOCR-7B-0225-preview (the document parsing model based on Qwen2VL). πŸ€—

✦ Demo : prithivMLmods/Doc-VLMs-OCR ~ ( with .md canvas )

β€· DREX-062225-exp : prithivMLmods/DREX-062225-exp
β€· typhoon-ocr-3b : scb10x/typhoon-ocr-3b
β€· VIREX-062225-exp : prithivMLmods/VIREX-062225-exp
β€· olmOCR-7B-0225-preview : allenai/olmOCR-7B-0225-preview

β€· Collection : prithivMLmods/doc-vl-685839064a863e1cd23be3f1
β€· Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
.
.
.

To know more about it, visit the model card of the respective model. !!
Β·
prithivMLmodsΒ 
posted an update about 1 month ago
view post
Post
2707
Updated the docscopeOCR-7B-050425-exp with the DREX-062225-exp, with improved preciseness in table structure and line spacing in the markdown used on the document page. And though this is still an experimental one, it's expected to perform well in the defined DREX use cases [ Document Retrieval and Extraction eXpert – experimental ocr ]. πŸ’»

β€· Model : prithivMLmods/DREX-062225-exp
β€· Demo : prithivMLmods/Doc-VLMs-OCR

β€· Collection : prithivMLmods/doc-vl-685839064a863e1cd23be3f1
β€· Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
β€· Git : https://github.com/PRITHIVSAKTHIUR/DREX.git
.
.
.

To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update about 1 month ago
view post
Post
1936
The demo for smoldocling / nanonets ocr / typhoon ocr / monkey ocr explores the document OCR capabilities of various newly released multimodal VLMs in a single space. And if you're experiencing or demoing long document image OCR, kindly use the Smoldocling 256M preview [ Smoldocling is back in demo here. ] πŸ€—.

✦ Try the demo here : prithivMLmods/Multimodal-OCR2

β€· MonkeyOCR Recognition : echo840/MonkeyOCR
β€· Nanonets-OCR-s : nanonets/Nanonets-OCR-s
β€· SmolDocling-256M-preview : ds4sd/SmolDocling-256M-preview
β€· typhoon-ocr-7b : scb10x/typhoon-ocr-7b

β€· Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

β€· Github : https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR2


The community GPU grant was given by Hugging Face β€” special thanks to them. πŸ€—πŸš€



To know more about it, visit the model card of the respective model. !!
  • 2 replies
Β·
prithivMLmodsΒ 
posted an update about 1 month ago
view post
Post
3914
The demo for the MonkeyOCR Recognition model, which adopts a Structure-Recognition-Relation (SRR) triplet paradigm & Nanonets-OCR-s a powerful, state-of-the-art image-to-markdown OCR model that goes far beyond traditional text extraction and other experimental document OCR models, is combined into a single space.

✦ Try the demo here : prithivMLmods/core-OCR
✦ Try Nanonets-OCR-s demo here : prithivMLmods/Multimodal-OCR

β€· MonkeyOCR Recognition : echo840/MonkeyOCR
β€· docscopeOCR-7B-050425-exp : prithivMLmods/docscopeOCR-7B-050425-exp
β€· coreOCR-7B-050325-preview : prithivMLmods/coreOCR-7B-050325-preview
β€· Nanonets-OCR-s : nanonets/Nanonets-OCR-s

β€· Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Also, include a sample OCR test using the VisionOCR-3B-061125 model and the Qwen2-VL-OCR-2B-Instruct model.
β€· Blog : https://huggingface.co/blog/prithivMLmods/visionocr-3b-061125-vs-qwen2-vl-ocr-2b-instruct

To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update about 2 months ago
view post
Post
5758
OpenAI, Google, Hugging Face, and Anthropic have released guides and courses on building agents, prompting techniques, scaling AI use cases, and more. Below are 10+ minimalistic guides and courses that may help you in your progress. πŸ“–

β€· Agents Companion : https://www.kaggle.com/whitepaper-agent-companion
β€· Building Effective Agents : https://www.anthropic.com/engineering/building-effective-agents
β€· Guide to building agents by OpenAI : https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf
β€· Prompt engineering by Google : https://www.kaggle.com/whitepaper-prompt-engineering
β€· Google: 601 real-world gen AI use cases : https://cloud.google.com/transform/101-real-world-generative-ai-use-cases-from-industry-leaders
β€· Prompt engineering by IBM : https://www.ibm.com/think/topics/prompt-engineering-guide
β€· Prompt Engineering by Anthropic : https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview
β€· Scaling AI use cases : https://cdn.openai.com/business-guides-and-resources/identifying-and-scaling-ai-use-cases.pdf
β€· Prompting Guide 101 : https://services.google.com/fh/files/misc/gemini-for-google-workspace-prompting-guide-101.pdf
β€· AI in the Enterprise by OpenAI : https://cdn.openai.com/business-guides-and-resources/ai-in-the-enterprise.pdf

by HFπŸ€— :
β€· AI Agents Course by Huggingface : https://huggingface.co/learn/agents-course/unit0/introduction
β€· Smol-agents Docs : https://huggingface.co/docs/smolagents/en/tutorials/building_good_agents
β€· MCP Course by Huggingface : https://huggingface.co/learn/mcp-course/unit0/introduction
β€· Other Course (LLM, Computer Vision, Deep RL, Audio, Diffusion, Cookbooks, etc..) : https://huggingface.co/learn
  • 2 replies
Β·
prithivMLmodsΒ 
posted an update about 2 months ago
view post
Post
2340
Just made a demo for Cosmos-Reason1, a physical AI model that understands physical common sense and generates appropriate embodied decisions in natural language through long chain-of-thought reasoning. Also added video understanding support to it. πŸ€—πŸš€

✦ Try the demo here : prithivMLmods/DocScope-R1

β€· Cosmos-Reason1-7B : nvidia/Cosmos-Reason1-7B
β€· docscopeOCR-7B-050425-exp : prithivMLmods/docscopeOCR-7B-050425-exp
β€· Captioner-Relaxed : Ertugrul/Qwen2.5-VL-7B-Captioner-Relaxed

β€· Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

β€· GitHub :
β€’ https://github.com/PRITHIVSAKTHIUR/Cosmos-x-DocScope
β€’ https://github.com/PRITHIVSAKTHIUR/Nvidia-Cosmos-Reason1-Demo.

To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update 2 months ago
view post
Post
2413
Got access to Google's all-new Gemini Diffusion a state-of-the-art text diffusion model. It delivers the performance of Gemini 2.0 Flash-Lite at 5x the speed, generating over 1000 tokens in a fraction of a second and producing impressive results. Below are some initial outputs generated using the model. β™ŠπŸ”₯

Gemini Diffusion Playground ✦ : https://deepmind.google.com/frontiers/gemini-diffusion

Get Access Here : https://docs.google.com/forms/d/1aLm6J13tAkq4v4qwGR3z35W2qWy7mHiiA0wGEpecooo/viewform?edit_requested=true

πŸ”— To know more, visit: https://deepmind.google/models/gemini-diffusion/
  • 1 reply
Β·
prithivMLmodsΒ 
posted an update 2 months ago
view post
Post
2364
The more optimized explicit content filters with lightweight π™œπ™ͺ𝙖𝙧𝙙 models trained based on siglip2 patch16 512 and vit patch16 224 for illustration and explicit content classification for content moderation in social media, forums, and parental controls for safer browsing environments. this version fixes the issues in the previous release, which lacked sufficient resources. πŸš€

β€· Models :
β†’ siglip2 mini explicit content : prithivMLmods/siglip2-mini-explicit-content [recommended]
β†’ vit mini explicit content : prithivMLmods/vit-mini-explicit-content

β€· Building image safety-guard models : strangerguardhf

β€· Datasets :
β†’ nsfw multidomain classification : strangerguardhf/NSFW-MultiDomain-Classification
β†’ nsfw multidomain classification v2.0 : strangerguardhf/NSFW-MultiDomain-Classification-v2.0

β€· Collection :
β†’ Updated Versions [05192025] : prithivMLmods/explicit-content-filters-682aaa4733e378561925ca2b
β†’ Previous Versions : prithivMLmods/siglip2-content-filters-042025-final-680fe4aa1a9d589bf2c915ff

Find a collections inside the collection.πŸ‘†

To know more about it, visit the model card of the respective model.
  • 1 reply
Β·