How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks Paper • 2507.01955 • Published 8 days ago • 29
EmoNet Collection The full collection of our EmoNet effort. More info available at: https://huggingface.co/blog/felfri/emonet • 8 items • Updated 18 days ago • 4
SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning Paper • 2506.21355 • Published 14 days ago • 9
DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement Paper • 2305.08227 • Published May 14, 2023 • 1
view article Article How to generate text: using different decoding methods for language generation with Transformers By patrickvonplaten • Mar 1, 2020 • 222
Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models Paper • 2504.07951 • Published Apr 10 • 29
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Paper • 2506.01844 • Published Jun 2 • 113
view article Article SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data By danaaubakirova and 8 others • Jun 3 • 188
view article Article LTX-Video LoRA training study (Single image/style training) By neph1 • Jan 14 • 3
PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers Paper • 2506.05573 • Published Jun 5 • 71
FlexPainter: Flexible and Multi-View Consistent Texture Generation Paper • 2506.02620 • Published Jun 3 • 14
SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers Paper • 2506.00830 • Published Jun 1 • 7
view article Article Powerful ASR + diarization + speculative decoding with Hugging Face Inference Endpoints By sergeipetrov and 3 others • May 1, 2024 • 77
MedGemma Release Collection Collection of Gemma 3 variants for performance on medical text and image comprehension to accelerate building healthcare-based AI applications. • 6 items • Updated about 12 hours ago • 187