simpx

AI & ML interests

None yet

Recent Activity

liked a model about 1 month ago

Qwen/Qwen3-0.6B

reacted to Kseniase's post with ❤️ about 1 month ago

11 Alignment and Optimization Algorithms for LLMs When we need to align models' behavior with the desired objectives, we rely on specialized algorithms that support helpfulness, accuracy, reasoning, safety, and alignment with user preferences. Much of a model’s usefulness comes from post-training optimization methods. Here are the main optimization algorithms (both classic and new) in one place: 1. PPO (Proximal Policy Optimization) -> https://huggingface.co/papers/1707.06347 Clips the probability ratio to prevent the new policy from diverging too far from the old one. It helps keep everything stable 2. DPO (Direct Preference Optimization) -> https://huggingface.co/papers/2305.18290 It's a non RL method, where an LM is an implicit reward model. It uses a simple loss to boost the preferred answer’s probability over the less preferred one 3. GRPO (Group Relative Policy Optimization) -> https://huggingface.co/papers/2402.03300 An RL method that compares a group of model outputs for the same input and updates the policy based on relative rankings. It doesn't need a separate critic model It's latest application is Flow-GRPO which adds online RL into flow matching models -> https://huggingface.co/papers/2505.05470 4. DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) -> https://huggingface.co/papers/2503.14476 Decouples the clipping bounds for flexibility, introducing 4 key techniques: clip-higher (to maintain exploration), dynamic sampling (to ensure gradient updates), token-level loss (to balance learning across long outputs), and overlong reward shaping (to handle long, truncated answers) 5. Supervised Fine-Tuning (SFT) -> https://huggingface.co/papers/2203.02155 Often the first post-pretraining step. A model is fine-tuned on a dataset of high-quality human-written input-output pairs to directly teach desired behaviors More in the comments 👇 If you liked it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe

liked a model about 1 month ago

ACE-Step/ACE-Step-v1-3.5B

View all activity

Organizations

None yet

simpx's activity

liked 2 models about 1 month ago

Qwen/Qwen3-0.6B

Text Generation • Updated 29 days ago • 1,000k • • 370

ACE-Step/ACE-Step-v1-3.5B

Text-to-Audio • Updated 28 days ago • 516

liked 2 datasets about 1 month ago

bigcode/starcoderdata

Viewer • Updated May 16, 2023 • 207M • 2.21k • 442

cerebras/SlimPajama-627B

Preview • Updated Jul 7, 2023 • 51.7k • 476

liked 2 models about 1 month ago

TinyLlama/TinyLlama-1.1B-Chat-v1.0

Text Generation • Updated Mar 17, 2024 • 1.33M • 1.3k

Qwen/Qwen3-235B-A22B

Text Generation • Updated 29 days ago • 138k • • 946

liked 9 datasets about 1 month ago

liked a model about 1 month ago

open-r1/OpenR1-Qwen-7B

Text Generation • Updated 22 days ago • 2.63k • 52

liked 4 datasets about 1 month ago

open-r1/codeforces-cots

Viewer • Updated Mar 28 • 254k • 3.71k • 176

open-thoughts/OpenThoughts-114k

Viewer • Updated 14 days ago • 228k • 39.8k • 714

Anthropic/hh-rlhf

Viewer • Updated May 26, 2023 • 169k • 10.8k • 1.36k

fka/awesome-chatgpt-prompts

Viewer • Updated Jan 6 • 203 • 22.8k • 7.97k