On Vacation 🏝️

3 9 37

Allanatrix PRO

Allanatrix

https://allanatrix.dev/

DarkStarStrix

AI & ML interests

ML engineering/Research

Recent Activity

liked a model 5 days ago

unsloth/Qwen3.6-35B-A3B-GGUF

liked a dataset 6 days ago

proxima-fusion/constellaration

liked a Space 6 days ago

proxima-fusion/constellaration-bench

View all activity

Organizations

liked a model 5 days ago

unsloth/Qwen3.6-35B-A3B-GGUF

Image-Text-to-Text • 35B • Updated 7 days ago • 1.57M • 800

liked a dataset 6 days ago

proxima-fusion/constellaration

Viewer • Updated Oct 28, 2025 • 930k • 1.93k • 20

liked a Space 6 days ago

ConStellaration Design Leaderboard

🔋

Explore the ConStellaration boundary leaderboard

reacted to anakin87's post with ❤️ 6 days ago

Post

10371

How LLM training with RL Environments works?

It all starts with 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗩𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗹𝗲 𝗥𝗲𝘄𝗮𝗿𝗱𝘀
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training

In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)

Consider a more complex tic-tac-toe env ❌⭕
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions

(envs can also include tools)

---

What happens at training?

We use 𝗚𝗿𝗼𝘂𝗽 𝗥𝗲𝗹𝗮𝘁𝗶𝘃𝗲 𝗣𝗼𝗹𝗶𝗰𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 with a tic-tac-toe env

No critic model needed, the group is the baseline
Simpler than PPO

1️⃣ Rollout generation: from the same board, model plays N games via sampling
2️⃣ Each game scored with deterministic rewards (win, format, ...)
3️⃣ Mean score computed across the group
4️⃣ Each rollout's advantage = its score minus the group mean
5️⃣ Model updated to favor trajectories above baseline

🔁 Repeat

For a deep dive, check out
🌱 https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs

2 replies

updated a Space 6 days ago

Aetrhon Labs

🧪

published a Space 6 days ago

Aetrhon Labs

🧪

published an article 6 days ago

Article

How I pre-trained a MS/MS model from scratch

6 days ago

liked a model 6 days ago

unsloth/Kimi-K2.6-GGUF

Image-Text-to-Text • 1T • Updated 5 days ago • 16.6k • 112

New activity in AethronPhantom/NexaMass-V3-Struct 7 days ago

Add MassSpecGym evaluation adapter and safetensors runtime loader

#1 opened 7 days ago by

Allanatrix

updated a model 7 days ago

AethronPhantom/NexaMass-V3-Struct

Feature Extraction • Updated 7 days ago • 67 • 1

upvoted a paper 9 days ago

MassSpecGym: A benchmark for the discovery and identification of molecules

Paper • 2410.23326 • Published Oct 30, 2024 • 1

published a model 12 days ago

AethronPhantom/NexaMass-V3-Struct

Feature Extraction • Updated 7 days ago • 67 • 1

liked a model 18 days ago

zai-org/GLM-5.1

Text Generation • 754B • Updated 11 days ago • 231k • • 1.52k

upvoted a changelog 26 days ago

Hugging Face Changelog

Storage Buckets for Spaces

27 days ago

• 129

updated a dataset 27 days ago

Allanatrix/qwen14b-opencodeinstruct-artifacts-20260307

Updated 27 days ago • 17

published a dataset 27 days ago

Allanatrix/qwen14b-opencodeinstruct-artifacts-20260307

Updated 27 days ago • 17

liked a Space 27 days ago

The Synthetic Data Playbook: Generating Trillions of the Finest Tokens

📝

229

Explore synthetic data experiments on a virtual bookshelf

upvoted a changelog about 1 month ago

Hugging Face Changelog

Introducing hf-mount

Mar 24

• 223

reacted to MaziyarPanahi's post with 🔥 about 1 month ago

Post

2241

We annotated 119K medical images with two frontier VLMs (Qwen 3.5, Kimi K2.5), cross-validated at 93% agreement, and produced 110K training records, all for under $500. Fine-tuning 3 small models (2-3B params) improved all benchmarks: best model reaches +15.0% average exact match.

Everything is open-sourced: datasets, adapters, and code.

https://huggingface.co/blog/OpenMed/synthvision