
AV LLMs
A collection of Audio, Video and Visual LLMs.
- Text-to-Speech • Updated • 451
- 1.05k
OpenVoice
🤗 dataautogpt3/ProteusV0.3
Text-to-Image • Updated • 124k • 93ByteDance/SDXL-Lightning
Text-to-Image • Updated • 90.8k • • 2.01kopenai/whisper-large-v3
Automatic Speech Recognition • Updated • 4.77M • • 4.27kstabilityai/TripoSR
Image-to-3D • Updated • 24.8k • 538Efficient-Large-Model/VILA-7b
Text Generation • Updated • 166 • 26google/paligemma-3b-pt-896
Image-Text-to-Text • Updated • 294 • 117microsoft/Phi-3-vision-128k-instruct
Text Generation • Updated • 26.9k • 960stabilityai/stable-audio-open-1.0
Text-to-Audio • Updated • 38.2k • 1.15kOpenVLA: An Open-Source Vision-Language-Action Model
Paper • 2406.09246 • Published • 39aiola/whisper-medusa-v1
Updated • 14 • 179merve/idefics3llama-vqav2
Updated • 8black-forest-labs/FLUX.1-schnell
Text-to-Image • Updated • 1.43M • • 3.65k- 114
Llama3.1 S V0.2 Checkpoint 2024 08 20
😻Convert text to audio and vice versa
gpt-omni/mini-omni
Text-to-Speech • Updated • 2 • 426fishaudio/fish-speech-1.4
Text-to-Speech • Updated • 791 • 451- 175
Tonic's GOT OCR
📲GOT - OCR (from : UCAS, Beijing)
stepfun-ai/GOT-OCR2_0
Image-Text-to-Text • Updated • 110k • 1.46kapple/coreml-sam2-large
Mask Generation • Updated • 25 • 25coreml-projects/sam-2-studio
Updated • 24mistralai/Pixtral-12B-2409
Image-Text-to-Text • Updated • • 628allenai/Molmo-72B-0924
Image-Text-to-Text • Updated • 1.77k • 284openai/whisper-large-v3-turbo
Automatic Speech Recognition • Updated • 3.84M • • 2.28kRevai/reverb-asr
Automatic Speech Recognition • Updated • 11 • 84- 357
GOT Online
💬Extract text from images using various OCR modes
facebook/vfusion3d
Image-to-3D • Updated • 40 • 66facebook/cotracker
Updated • 749 • 35rhymes-ai/Aria
Image-Text-to-Text • Updated • 14.8k • 626SWivid/F5-TTS
Text-to-Speech • Updated • 1.05M • 982- 64
Ichigo Llama3.1 S Instruct
🏢Generate text from audio recordings
kyutai/moshiko-mlx-q4
Updated • 869 • 28kyutai/moshiko-mlx-q8
Updated • 659 • 5- 106
Open VLM Video Leaderboard
🌎VLMEvalKit Eval Results in video understanding benchmark
jimmycarter/LibreFLUX
Text-to-Image • Updated • 86 • 163microsoft/OmniParser
Image-Text-to-Text • Updated • 919 • 1.66k- 309
Aya Models
🌍Interact with the Aya family of models.
CohereLabs/aya-expanse-32b
Text Generation • Updated • 10.4k • • 246stabilityai/stable-diffusion-3.5-medium
Text-to-Image • Updated • 372k • • 673OuteAI/OuteTTS-0.1-350M
Text-to-Speech • Updated • 1.07k • 301vidore/colpali
Visual Document Retrieval • Updated • 11.6k • 434vidore/colpali-v1.2
Visual Document Retrieval • Updated • 69.8k • 106si-pbc/hertz-dev
Audio-to-Audio • Updated • 211- 38
Talk To Ultravox
⚡Talk to Fixie.ai's Ultravox with WebRTC ⚡️
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper • 2411.10440 • Published • 124Xkev/Llama-3.2V-11B-cot
Image-Text-to-Text • Updated • 4.23k • 150google/paligemma-3b-pt-224
Image-Text-to-Text • Updated • 48.8k • 316apple/coreml-mobileclip
Updated • 214 • 41InstantX/InstantIR
Image-to-Image • Updated • 4 • 170- 86
InstantIR
🖼diffusion-based Image Restoration model
- 151
Flux IP Adapter
🖼Prompt with Images in flux[dev]
- 38
Image Preferences - Argilla annotation space
🖼A community project to create an image preferences dataset.
fishaudio/fish-speech-1.5
Text-to-Speech • Updated • 17.2k • 536meta-llama/Llama-3.3-70B-Instruct
Text Generation • Updated • 1.09M • • 2.26k- 48
Paligemma2 Vqav2
🐨PaliGemma2 LoRA finetuned on VQAv2
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper • 2412.04467 • Published • 111fancyfeast/llama-joycaption-alpha-two-hf-llava
Updated • 14.7k • 171taohu/mask
Updated • 5[MASK] is All You Need
Paper • 2412.06787 • Published • 2- 706
Open VLM Leaderboard
🌎VLMEvalKit Evaluation Results Collection
microsoft/LLM2CLIP-Llama3.2-1B-EVA02-L-14-336
Zero-Shot Image Classification • Updated • 10LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper • 2411.04997 • Published • 40Generative Powers of Ten
Paper • 2312.02149 • Published • 8- 25
StoryStar
💬Fantasy story generator
GoodiesHere/Apollo-LMMs-Apollo-7B-t32
Video-Text-to-Text • Updated • 106 • 55Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper • 2412.10360 • Published • 147Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text • Updated • 1.08M • • 1.18kXiaoduoAILab/Xmodel_VLM
Text Generation • Updated • 96 • 12nvidia/Cosmos-1.0-Diffusion-14B-Text2World
Updated • 16.2k • 55nvidia/Cosmos-1.0-Autoregressive-12B
Updated • 69 • 30nvidia/Cosmos-1.0-Autoregressive-13B-Video2World
Updated • 93 • 31nvidia/Cosmos-1.0-Diffusion-7B-Text2World
Updated • 65.7k • 217nvidia/Cosmos-1.0-Diffusion-14B-Video2World
Updated • 19k • 55- 405
Stable Point-Aware 3D
⚡Create 3D models from images
hexgrad/Kokoro-82M
Text-to-Speech • Updated • 2.02M • 4.03k- 2.46k
Kokoro TTS
❤Upgraded to v1.0!
openbmb/MiniCPM-o-2_6
Any-to-Any • Updated • 879k • 1.1k- 346
TTS Spaces Arena
🤗Blind vote on HF TTS models!
google/paligemma2-10b-pt-896
Image-Text-to-Text • Updated • 437 • 31NovaSky-AI/Sky-T1-32B-Preview
Text Generation • Updated • 37.3k • 543MiniMaxAI/MiniMax-VL-01
Image-Text-to-Text • Updated • 138 • 252- 58
SmolVLM
📊Generate descriptions from images and text prompts
HKUSTAudio/Llasa-3B
Text-to-Speech • Updated • 2.04k • 486HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text • Updated • 30.9k • 118deepseek-ai/Janus-Pro-7B
Any-to-Any • Updated • 242k • 3.33k- 263
Kokoro TTS Zero
🎴✨[With v1.0.0] Accelerated TTS on Kokoro-82M
kyutai/hibiki-2b-mlx-bf16
Translation • Updated • 207 • 17kyutai/hibiki-2b-pytorch-bf16
Translation • Updated • 116 • 50ARTPARK-IISc/Vaani
Viewer • Updated • 9.72M • 2.38k • 49Zyphra/Zonos-v0.1-hybrid
Text-to-Speech • Updated • 11.2k • 1.06kZyphra/Zonos-v0.1-transformer
Text-to-Speech • Updated • 73.4k • 391microsoft/OmniParser-v2.0
Updated • 2.57k • 1.22k- 88
Paligemma2 Mix
🌖Generate text or segment objects from an image
google/paligemma2-3b-mix-448
Image-Text-to-Text • Updated • 20.7k • 44google/paligemma2-3b-mix-224
Image-Text-to-Text • Updated • 33.9k • 26google/paligemma2-28b-mix-224
Image-Text-to-Text • Updated • 1.78k • 4google/paligemma2-28b-mix-448
Image-Text-to-Text • Updated • 305 • 26google/paligemma2-10b-mix-224
Image-Text-to-Text • Updated • 527 • 7google/paligemma2-10b-mix-448
Image-Text-to-Text • Updated • 31.9k • 25stepfun-ai/stepvideo-t2v
Text-to-Video • Updated • 295 • 426stepfun-ai/stepvideo-t2v-turbo
Updated • 89Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Paper • 2502.10248 • Published • 55HuggingFaceTB/SmolVLM2-2.2B-Instruct
Image-Text-to-Text • Updated • 40.7k • 158nvidia/canary-1b
Automatic Speech Recognition • Updated • 17.3k • 421Wan-AI/Wan2.1-I2V-14B-720P
Image-to-Video • Updated • 27.8k • 415fastrtc/kokoro-onnx
Updated • 10- 2
Fastphone
🐠Download and run an app from a Hugging Face repository
microsoft/Phi-4-multimodal-instruct
Automatic Speech Recognition • Updated • 800k • 1.3kmicrosoft/Magma-8B
Image-Text-to-Text • Updated • 5.29k • 354- 27
Magma UI
📚Magma-8B model for UI Agents
- 545
Di♪♪Rhythm
🎶Blazingly Fast and Embarrassingly Simple Song Generation
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
Paper • 2503.01183 • Published • 26ASLP-lab/DiffRhythm-vae
Updated • 38ASLP-lab/DiffRhythm-base
Updated • 353 • 156Large Language Diffusion Models
Paper • 2502.09992 • Published • 112GSAI-ML/LLaDA-8B-Instruct
Text Generation • Updated • 110k • 244unsloth/gemma-3-12b-pt
Image-Text-to-Text • Updated • 1.35k • 3google/gemma-3-27b-it
Image-Text-to-Text • Updated • 925k • • 1.18ksesame/csm-1b
Text-to-Speech • Updated • 103k • 1.88kunsloth/gemma-3-27b-it-GGUF
Image-Text-to-Text • Updated • 147k • 92ds4sd/SmolDocling-256M-preview
Image-Text-to-Text • Updated • 86.5k • 1.23kstarvector/starvector-8b-im2svg
Text Generation • Updated • 51.5k • 445starvector/starvector-1b-im2svg
Text Generation • Updated • 11.6k • 155Tokenize Image as a Set
Paper • 2503.16425 • Published • 15kyutai/moshika-vis-pytorch-bf16
Updated • 56kyutai/Babillage
Viewer • Updated • 465k • 781 • 9ByteDance/InfiniteYou
Text-to-Image • Updated • 10.7k • 569InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Paper • 2503.16418 • Published • 34openfree/flux-chatgpt-ghibli-lora
Text-to-Image • Updated • 13k • 244Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
Paper • 2504.00595 • Published • 34weizhiwang/Open-Qwen2VL
Image-Text-to-Text • Updated • 386 • 13ostris/Flex.1-alpha-Redux
Updated • 62unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit
Image-Text-to-Text • Updated • 54.4k • 74unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-8bit
Image-Text-to-Text • Updated • 1.73k • 9SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 158canopylabs/3b-hi-ft-research_release
Text-to-Speech • Updated • 840 • 13canopylabs/3b-es_it-ft-research_release
Text-to-Speech • Updated • 458 • 8nvidia/C-RADIOv2-g
Updated • 295 • 10