Jen Wei

bird-of-paradise

AI & ML interests

My research interests focus on transformer architecture and RL for reasoning tasks. I'm passionate about ML and committed to contributing to advancements in the field through both technical work and community engagement.

Recent Activity

commented on an article 5 days ago

SmolLM3: smol, multilingual, long-context reasoner

updated a Space 6 days ago

bird-of-paradise/post-training-techniques-guide

published a Space 6 days ago

bird-of-paradise/post-training-techniques-guide

View all activity

Organizations

None yet

commented on SmolLM3: smol, multilingual, long-context reasoner 5 days ago

Hi HF team, this SmolLM3 post got me curious about why you chose APO over GRPO, so I dove into comparing approaches across SmolLM3, Tulu3, and DeepSeek-R1. Ended up creating a visual guide to help navigate the post-training landscape on 🤗 space.

Really interesting to see how different teams are solving similar problems with different technique combinations!

updated a Space 6 days ago

Post Training Techniques Guide

🚀

A visual guide to post-training techniques for LLMs

published a Space 6 days ago

Post Training Techniques Guide

🚀

A visual guide to post-training techniques for LLMs

commented on SmolLM3: smol, multilingual, long-context reasoner 6 days ago

"distillation" means distill knowledge from another model. "on-policy" means form the model you are training. So yeah, if you "distill" it cannot be on-policy.

updated a Space 12 days ago

ReTool Implementation

🔧

Enhance LLM reasoning with code execution

commented on SmolLM3: smol, multilingual, long-context reasoner 16 days ago

I gave some more thoughts to my own question and I wonder if the main reason is that you wanted to train a off-policy model so that you can leverage bigger models' output as training data(distillation). So that's why on-policy algorithms like GRPO is not suitable for this situation.

commented on SmolLM3: smol, multilingual, long-context reasoner 17 days ago

I've been studying SmolLM3's dual-mode training approach and have a technical question about the choice of Anchored Preference Optimization (APO) over Group Relative Policy Optimization (GRPO) for handling reasoning capabilities.

Based on my understanding of both approaches:

APO (like DPO) works well for general instruction following and can handle reasoning tasks given appropriate preference data, which you generated using Qwen models
GRPO was specifically designed for mathematical reasoning with process supervision and eliminates the need for a value model, potentially offering computational efficiency advantages

I'm hypothesizing that APO was chosen because:

It provided a unified alignment approach for both reasoning and non-reasoning modes
It worked well with your synthetic preference data generation pipeline
You're treating reasoning as a specialized mode of instruction following rather than a fundamentally different task
The computational benefits of GRPO might not have outweighed the implementation complexity for your specific training setup

Could you clarify if I'm on the right track with this understanding? I'm particularly interested in whether you considered GRPO for the reasoning optimization and what factors ultimately led to choosing APO for both modes.

Thank you for sharing these details about SmolLM3's training recipe - the dual-mode approach and training pipeline are fascinating!

New activity in bird-of-paradise/deepseek-mla 23 days ago

Why do you have UV, UK, and UQ?

#2 opened 23 days ago by

hudsongouge

commented on SmolLM3: smol, multilingual, long-context reasoner 23 days ago

Congratulations to the team on the release of SmolLM3! This is a really impressive piece of work, and the detailed ablations on GQA, NoPE, and the other architectural tweaks are super valuable for the community.

Your focus on pushing the boundaries of long-context performance is fascinating. It reminded me of a recent paper that tackles the same challenge from a completely different architectural angle.

I was just reading about ATLAS from Google Research (arXiv:2505.23735), which proposes replacing the standard attention mechanism with a modern recurrent "long-term memory module".

The core idea is to overcome the limitations of both Transformers and older RNNs by creating a memory that explicitly "learns to memorize the context" instead of just individual tokens. They introduce a concept called the Omega Rule, which updates the memory based on a sliding window of the past, rather than the "online" token-by-token update that can lead to issues like the "lost in the middle" problem.

The results they show are compelling, especially its ability to scale effectively to a 10M context length on the BABILong benchmark.

It's exciting to see two powerful—and very different—approaches for scaling context length. One path is perfecting the Transformer architecture (like SmolLM3), and the other is designing new memory-centric recurrent models (like ATLAS).

I'm curious to hear your thoughts on the future of these alternative architectures and if you envision a future where hybrid models might combine the best of both worlds!

Here's the paper for anyone interested:

upvoted an article 24 days ago

Article