Austin362667 (Austin Liu)

upvoted an article 2 days ago

Article

An Analysis of Chinese LLM Censorship and Bias with Qwen 2 Instruct

By

•

Jun 11, 2024

• 63

upvoted an article 6 days ago

Article

Parquet Content-Defined Chunking

By

•

8 days ago

• 50

upvoted an article 7 days ago

Article

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

By

and 8 others •

Jun 3

• 213

reacted to danaaubakirova's post with 🤗❤️ 7 days ago

Post

2363

We just dropped SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics!

check out the blog: https://huggingface.co/blog/smolvla
read the technical report: SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics (2506.01844)
access the model weights: lerobot/smolvla_base

reacted to andito's post with 👀🔥 7 days ago

Post

3960

🧠👁️ Can AI visualize solutions?

Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal “mental sketches”?

That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.

These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.

🔧 Mirage is trained in two phases:

1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.

📈 And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.

By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one that’s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)

4 replies

·

reacted to eliebak's post with 👀🔥 7 days ago

Post

4455

Kimi K2 tech report is full of gems as always. Here are my notes on it:

> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.

With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.

> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.

The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU

reacted to andito's post with ❤️🔥 7 days ago

Post

2671

Many VLMs claim to process hours of video. But can they follow the story?🤔
Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!⏳

We test three skills that matter for real-world use:
🔎 Localized Retrieval: Find a specific action.
🧩 Information Synthesis: Piece together scattered clues.
🏃 Fine-Grained Perception: Analyze detailed motion (e.g., count how many times a person swings an axe).

The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos.
Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking points—now the community can start fixing them.📈

Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.

📖 Blog:
https://huggingface.co/blog/timescope-video-lmm-benchmark
👩‍💻 Leaderboard & Demo: Apollo-LMMs/TimeScope
📊 Dataset: Apollo-LMMs/TimeScope
⚙️ Eval Code: https://github.com/EvolvingLMMs-Lab/lmms-eval

upvoted an article 7 days ago

Article

TimeScope: How Long Can Your Video Large Multimodal Model Go?

By

and 3 others •

10 days ago

• 31

upvoted an article 9 days ago

Article

⚡ nano-vLLM: Lightweight, Low-Latency LLM Inference from Scratch

By

•

Jun 28

• 12

upvoted an article 19 days ago

Article

SmolVLM2: Bringing Video Understanding to Every Device

By

and 6 others •

Feb 20

• 288

upvoted an article about 2 months ago

Article

Introducing Cosmos Predict-2: A Foundation For Your Own World Model

By

and 2 others •

Jun 17

• 8

upvoted an article 2 months ago

Article

🐯 Liger GRPO meets TRL

By

and 5 others •

May 25

• 47

upvoted an article 4 months ago

Article

Introduction to 3D Gaussian Splatting

By

•

Sep 18, 2023

• 97

upvoted an article 6 months ago

Article

Mixture of Experts Explained

By

and 5 others •

Dec 11, 2023

• 788

reacted to hexgrad's post with 🔥 6 months ago

Post

21022

📣 Looking for labeled, high-quality synthetic audio/TTS data 📣 Have you been or are you currently calling API endpoints from OpenAI, ElevenLabs, etc? Do you have labeled audio data sitting around gathering dust? Let's talk! Join https://discord.gg/QuGxSWBfQy or comment down below.

If your data exceeds quantity & quality thresholds and is approved into the next hexgrad/Kokoro-82M training mix, and you permissively DM me the data under an effective Apache license, then I will DM back the corresponding voicepacks for YOUR data if/when the next Apache-licensed Kokoro base model drops.

What does this mean? If you've been calling closed-source TTS or audio API endpoints to:
- Build voice agents
- Make long-form audio, like audiobooks or podcasts
- Handle customer support, etc
Then YOU can contribute to the training mix and get useful artifacts in return. ❤️

More details at hexgrad/Kokoro-82M#21

23 replies

·

upvoted an article 11 months ago

Article

What is going on with AlphaFold3?

By

•

May 21, 2024

• 15

Austin Liu

AI & ML interests

Recent Activity

Organizations

An Analysis of Chinese LLM Censorship and Bias with Qwen 2 Instruct

Parquet Content-Defined Chunking

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

TimeScope: How Long Can Your Video Large Multimodal Model Go?

⚡ nano-vLLM: Lightweight, Low-Latency LLM Inference from Scratch

SmolVLM2: Bringing Video Understanding to Every Device

Introducing Cosmos Predict-2: A Foundation For Your Own World Model

🐯 Liger GRPO meets TRL

Introduction to 3D Gaussian Splatting

Mixture of Experts Explained

What is going on with AlphaFold3?

Austin Liu

AI & ML interests

Recent Activity

Organizations

Austin362667's activity

An Analysis of Chinese LLM Censorship and Bias with Qwen 2 Instruct

Parquet Content-Defined Chunking

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

TimeScope: How Long Can Your Video Large Multimodal Model Go?

⚡ nano-vLLM: Lightweight, Low-Latency LLM Inference from Scratch

SmolVLM2: Bringing Video Understanding to Every Device

Introducing Cosmos Predict-2: A Foundation For Your Own World Model

🐯 Liger GRPO meets TRL

Introduction to 3D Gaussian Splatting

Mixture of Experts Explained

What is going on with AlphaFold3?