Thomas Liang PRO

thliang01

AI & ML interests

Efficient ML, diffusion model, LLM, post-training

Recent Activity

published a Space 1 day ago
thliang01/streamlit-twinkle-gallery
reacted to lianghsun's post with πŸ”₯ 1 day ago
With the arrival of Twinkle April β€” Twinkle AI’s annual open-source celebration held every April β€” our community is excited to unveil its very first project: πŸ“Š Twinkle Eval (https://github.com/ai-twinkle/Eval), a next-generation evaluation tool led by our contributor @tedslin . Unlike traditional evaluation tools like iKala’s ievals (https://github.com/ikala-ai/ievals), which can only evaluate language models (LMs) one sample at a time, Twinkle Eval is designed with Large Reasoning Models (LRMs) in mind. As reasoning time increases with more complex models, traditional tools become increasingly inefficient 😲 β€” for example, evaluating LRMs on the https://huggingface.co/datasets/ikala/tmmluplus benchmark could take * half a day without finishing. One question we were especially curious about: Does shuffling multiple-choice answer order impact model accuracy? πŸ€” β†’ See: "Change Answer Order Can Decrease MMLU Accuracy" – arXiv:2406.19470v1 To address these challenges, Twinkle Eval brings three key innovations to the table: 1️⃣ Parallelized evaluation of samples 2️⃣ Multi-round testing for stability 3️⃣ Randomized answer order to test robustness After running experiments, we observed that Twinkle Eval can speed up evaluation by up to 15Γ— πŸš€πŸš€. Interestingly, most models scored slightly lower under the 2️⃣3️⃣ test settings compared to their claimed performance β€” suggesting further benchmarking is needed. This framework also comes with additional tunable parameters and detailed logging of LM behavior per question β€” perfect for those who want to dive deeper. πŸ˜† If you find Twinkle Eval useful, please ⭐ the project and help spread the word πŸ€—
View all activity

Organizations

lora concepts library's profile picture Stable Diffusion Dreambooth Concepts Library's profile picture ZeroGPU Explorers's profile picture Project Fluently's profile picture MLX Community's profile picture Social Post Explorers's profile picture Stable Diffusion Community (Unofficial, Non-profit)'s profile picture Hugging Face Discord Community's profile picture Twinkle AI's profile picture Hugging Face MCP Course's profile picture Agents-MCP-Hackathon's profile picture OpenAI gpt-oss Grants's profile picture Scratch to Scale's profile picture