gg-tt (gg-tt)

posted an update 5 days ago

Post

1612

Saying Claude 4 is "the best coding model in the world" from the SWEBench scores is super misleading, and here is why:

If you look at the announcement table, their model has the best scores, but... if you look at the very bottom, in font 4, you'll see that the metric they report is actually not the same metric as the one used for the other models!

Comparing "pass@1 averaged 10 times" to "normal pass@1" is like grading one student by allowing them to take the test 10 times and averaging question scores, when the other students only get one chance at grading.

The first way to grade (avg@10) is actually quite good statistically, much better than what model creators usually report, because models tend to be quite inconsistent - sometimes good, sometimes bad...
But! You want to do it for all models then, and report with error bars.
The issue is that, if you do... well, it's going to be harder to say your model is the best, because the error bars will overlap between models, by a lot.

Also, you'll see that 2 numbers are reported: the first one is using avg@10 (what I explained above), and the second, highest one is using this plus many other tricks:
- test time compute (so having the model generate a tree of answers and selecting the best as you go, more or less)
- removing the times when the model breaks the tests
- and using another model to select the most promising solution!
You can't really say it's better than the rest, mostly because it's **way less efficient** to achieve a similar result.

It's honestly a bit sad because from user reports, the model sounds good - however, this announcement is overblown numbers wise, and I'm quite sure it's more a problem of "too much marketing" than of "bad science"

Another thing which makes the comparison invalid is the complete absence of open source from the report - don't think they are aware of DeepSeek/ Qwen/The new mistral for code/and all the cool specialised models found on the hub?

1 reply

·

reach-vb

posted an update 9 days ago

Post

3272

hey hey @mradermacher - VB from Hugging Face here, we'd love to onboard you over to our optimised xet backend! 💥

as you know we're in the process of upgrading our storage backend to xet (which helps us scale and offer blazingly fast upload/ download speeds too): https://huggingface.co/blog/xet-on-the-hub and now that we are certain that the backend can scale with even big models like Llama 4/ Qwen 3 - we;re moving to the next phase of inviting impactful orgs and users on the hub over as you are a big part of the open source ML community - we would love to onboard you next and create some excitement about it in the community too!

in terms of actual steps - it should be as simple as one of the org admins to join hf.co/join/xet - we'll take care of the rest.

p.s. you'd need to have a the latest hf_xet version of huggingface_hub lib but everything else should be the same: https://huggingface.co/docs/hub/storage-backends#using-xet-storage

p.p.s. this is fully backwards compatible so everything will work as it should! 🤗

11 replies

·

clefourrier

posted an update 10 days ago

Post

507

Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks

2 replies

·

gneubig

authored a paper 13 days ago

The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Paper • 2505.10185 • Published 13 days ago • 25

paultimothymooney

authored 2 papers 15 days ago

CORD-19: The COVID-19 Open Research Dataset

Paper • 2004.10706 • Published Apr 22, 2020

Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

Paper • 2505.00612 • Published 27 days ago • 9

PhilCulliton

authored a paper 15 days ago

Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

Paper • 2505.00612 • Published 27 days ago • 9

danielhanchen

posted an update 28 days ago

Post

1818

💜 Qwen3 128K Context Length: We've released Dynamic 2.0 GGUFs + 4-bit safetensors!
Fixed: Now works on any inference engine and fixed issues with the chat template.
Qwen3 GGUFs:
30B-A3B: unsloth/Qwen3-30B-A3B-GGUF
235-A22B: unsloth/Qwen3-235B-A22B-GGUF
32B: unsloth/Qwen3-32B-GGUF

Read our guide on running Qwen3 here: https://docs.unsloth.ai/basics/qwen3-how-to-run-and-finetune

128K Context Length:
30B-A3B: unsloth/Qwen3-30B-A3B-128K-GGUF
235-A22B: unsloth/Qwen3-235B-A22B-128K-GGUF
32B: unsloth/Qwen3-32B-128K-GGUF

All Qwen3 uploads: unsloth/qwen3-680edabfb790c8c34a242f95

sanmikoyejo

authored a paper 28 days ago

The Leaderboard Illusion

Paper • 2504.20879 • Published 29 days ago • 69

vukosi

authored 8 papers 29 days ago

Investigating the Efficacy of Large Language Models in Reflective Assessment Methods through Chain of Thoughts Prompting

Paper • 2310.00272 • Published Sep 30, 2023 • 1

Masakhane -- Machine Translation For Africa

Paper • 2003.11529 • Published Mar 13, 2020

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

Paper • 2010.02353 • Published Oct 5, 2020

BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages

Paper • 2502.11926 • Published Feb 17 • 2

The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

Paper • 2502.15916 • Published Feb 21

Xenova

posted an update about 1 month ago

Post

7423

Introducing the ONNX model explorer: Browse, search, and visualize neural networks directly in your browser. 🤯 A great tool for anyone studying Machine Learning! We're also releasing the entire dataset of graphs so you can use them in your own projects! 🤗

Check it out! 👇
Demo: onnx-community/model-explorer
Dataset: onnx-community/model-explorer
Source code: https://github.com/xenova/model-explorer

danielhanchen

posted an update about 1 month ago

Post

5806

🦥 Introducing Unsloth Dynamic v2.0 GGUFs!
Our v2.0 quants set new benchmarks on 5-shot MMLU and KL Divergence, meaning you can now run & fine-tune quantized LLMs while preserving as much accuracy as possible.

Llama 4: unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF
DeepSeek-R1: unsloth/DeepSeek-R1-GGUF-UD
Gemma 3: unsloth/gemma-3-27b-it-GGUF

We made selective layer quantization much smarter. Instead of modifying only a subset of layers, we now dynamically quantize all layers so every layer has a different bit. Now, our dynamic method can be applied to all LLM architectures, not just MoE's.

Blog with Details: https://docs.unsloth.ai/basics/dynamic-v2.0

All our future GGUF uploads will leverage Dynamic 2.0 and our hand curated 300K–1.5M token calibration dataset to improve conversational chat performance.

For accurate benchmarking, we built an evaluation framework to match the reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, QAT and standard iMatrix quants.

Dynamic v2.0 aims to minimize the performance gap between full-precision models and their quantized counterparts.

philschmid

posted an update about 1 month ago

Post

2819

Gemini 2.5 Flash is here! We excited launch our first hybrid reasoning Gemini model. In Flash 2.5 developer can turn thinking off.

**TL;DR:**
- 🧠 Controllable "Thinking" with thinking budget with up to 24k token
- 🌌 1 Million multimodal input context for text, image, video, audio, and pdf
- 🛠️ Function calling, structured output, google search & code execution.
- 🏦 $0.15 1M input tokens; $0.6 or $3.5 (thinking on) per million output tokens (thinking tokens are billed as output tokens)
- 💡 Knowledge cut of January 2025
- 🚀 Rate limits - Free 10 RPM 500 req/day
- 🏅Outperforms 2.0 Flash on every benchmark

Try it ⬇️
https://aistudio.google.com/prompts/new_chat?model=gemini-2.5-flash-preview-04-17

1 reply

·

gg-tt

AI & ML interests

Recent Activity

gg-tt's activity

The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

CORD-19: The COVID-19 Open Research Dataset

Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

The Leaderboard Illusion

Investigating the Efficacy of Large Language Models in Reflective Assessment Methods through Chain of Thoughts Prompting

Masakhane -- Machine Translation For Africa

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

AI4D -- African Language Program

Cross-lingual transfer of multilingual models on low resource African Languages

From N-grams to Pre-trained Multilingual Models For Language Identification

BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages

The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

AI & ML interests

Recent Activity

Team members 97

gg-tt's activity