Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

SeaWolf-AI 
posted an update 1 day ago
view post
Post
5810
🚀 Introducing FINAL-Bench Quantum — an open, neutral benchmark that finally puts quantum-computing methods on one fair yardstick.

Quantum results are notoriously hard to compare. The same "logical error rate" or "query fidelity" means very different things depending on the code, noise model, hardware, and shot count. FINAL-Bench Quantum fixes that: five events judged under identical, published protocols, where every number is labeled as either measured here or quoted from a source.

Five events: ① QEC Decoder ② Optimization (Max-Cut) ③ VQE ④ QRAM ⑤ Quantum Simulation

The rules are simple and strict:
✅ Track A (measured here, with 95% confidence intervals) is kept separate from Track B (quoted from papers, not directly comparable).
🔬 Simulation and real hardware are clearly distinguished, and no quantum-advantage claims are made.
🌍 Methods from Google, IBM, NVIDIA, USTC, Riverlane and more sit side by side, with origin flags and author credits.
📤 Anyone can submit their own method via the Submit tab for review and listing.

Already on the board: real IBM Heron r2 measurements (repetition-code distance boundary, 29–175× error reduction from d3 to d5), a real-chip QRAM query fidelity of 0.92, and H₂ VQE at chemical accuracy — always labeled honestly as simulation vs hardware.

A leaderboard is only useful if you can trust it, so neutrality is the whole point: strong competitors stay in even when they beat the host, sources are quoted faithfully, and a simulation is never rounded up into a hardware claim.

Leaderboard: FINAL-Bench/quantum-bench-leaderboard
Article: https://huggingface.co/blog/FINAL-Bench/quantum-leaderboard

#quantum #QEC #QuantumComputing #benchmark
  • 2 replies
·
SeaWolf-AI 
posted an update about 14 hours ago
view post
Post
1899
Darwin V9 — GPQA Diamond 90.9%, #1 on the leaderboard, with pure greedy decoding
Darwin-398B-JGOS reaches 90.9% (180/198) on GPQA Diamond, the PhD-level scientific reasoning benchmark, ranking #1 on the Hugging Face GPQA Diamond leaderboard. No self-consistency, no test-time compute scaling — this was achieved with a single greedy decode (temperature 0, single sample, max 16,384 tokens). The full eval config is published in the model card, so anyone can reproduce it. Raw reasoning, no score inflation.
The result comes from Darwin V9, a patented evolutionary model-development platform. Its core idea: it never trains a model from scratch.
Why Darwin V9 beats training from scratch

Cost & speed: no trillion-token pretraining run, no months of compute — a purpose-built, high-performance model is produced in a fraction of the time.
Reuse of proven intelligence: instead of re-learning every capability from a blank slate, it selects and combines only the strengths of already-trained, already-validated models, so results are stable and predictable.
Surgical transplantation: it identifies which neural region of which model holds which capability — at the FFN (Feed Forward Network) layer level — and grafts in only the segments that contribute to the target skill.

How it works: a large model (Qwen 3.5 397B) serves as the mother model (the substrate); several father models specialized in reasoning, coding, and language are analyzed layer-by-layer across their FFN regions; the segments that contribute to the target performance are extracted and transplanted into the mother model to produce a new child model. The result is a ~400B MoE that activates only ~17B parameters per token at inference — large-model capacity with efficient inference.
If training from scratch means rebuilding everything from a blank page, Darwin V9 means precisely recombining intelligence that has already been proven. GPQA Diamond #1 is the proof.
Model: FINAL-Bench/Darwin-398B-JGOS
OzTianlu 
posted an update 2 days ago
view post
Post
6177
ResNet is Explicit Euler. GPT is Implicit Euler. What Else is Hiding in Plain Sight?

Read online: https://datawhalechina.github.io/learning-terrain/

I wrote an open-source monograph on learning dynamics — The Terrain of Learning. Bilingual (Chinese/English), 4 volumes, 12 chapters, 30+ print-grade figures. Completely free (CC BY-NC-SA 4.0).

The core argument: gradient descent is not optimization. It's terrain motion. The loss function is a landscape. The gradient is the direction of slope. The optimizer is how you choose each step. Once you see it this way, everything clicks:

ResNet = explicit Euler integration on a vector field. The residual branch is the vector field. Each layer takes one Euler step.

GPT autoregression = implicit-state Euler iteration. Stable where explicit Euler explodes. That's why transformers handle long-range dependencies.

DEQ = the Banach fixed-point theorem in production. The forward pass is root-finding. There are no layers to backprop through.

KL divergence = a Bregman divergence on the entropy landscape. Your belief space is curved, not flat.

Chain-of-thought reasoning = hidden states flowing along a reasoning field toward an attractor basin. Correct answers have wide basins. The number of reasoning steps is determined by the terrain, not by the problem.

Diffusion models = systems flowing downhill along a score vector field, from noise to structure, from high energy to low energy.

The book traces one idea across 337 years — from F=ma (Newton, 1687) to H=T+V (Hamilton, 1833) to loss landscape + gradient field (2020s). Hamilton replaced a catalog of forces with one geometric object. This book does the same for deep learning.

GitHub: https://github.com/datawhalechina/learning-terrain
Discussion: https://github.com/datawhalechina/learning-terrain/discussions/2

Convergence is not hope. Convergence is geometry. You see.
  • 1 reply
·
YerbaPage 
posted an update 1 day ago
kanaria007 
posted an update 1 day ago
view post
Post
116
✅ Article highlight: *Adversaries, Data Poisoning, and Incentive Governance for Training Worlds* (art-60-171, v0.1)

TL;DR:
This article argues that training worlds become adversarial markets.

If gameplay data trains agents, players, UGC authors, operators, and supply-chain actors will try to shape the data. If labels and rewards shape what gets learned, then labels and rewards are governance surfaces too. 171 turns data poisoning and incentive gaming into receipted lifecycles.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
• makes “training set T is admissible for run R” a governed claim
• treats poisoning as a caseable process, not a vague abuse report
• fails closed when monitoring is unhealthy or detector drift is detected
• treats labels, rewards, collusion, and sybil pressure as governance problems
• connects data integrity to courts, appeals, and bounded publication

What’s inside:
• training substrate governance contracts
• adversary taxonomy for players, UGC, operators, and supply-chain actors
• quarantine → adjudication → inclusion / exclusion pipeline
• monitoring SLOs, monitor health receipts, and detector drift incidents
• label economy contracts and reward distribution receipts
• anti-sybil and collusion monitoring
• admissibility verdict receipts for deciding what may train the next run

Key idea:
Do not say:

*“we filtered poisoned data.”*

Say:

*“this substrate was admitted under this governance contract, adversary taxonomy, monitoring SLO, quarantine/adjudication trail, label economy, reward policy, and admissibility verdict.”*

Data and rewards are governance with receipts.
kasbsquall 
posted an update 4 days ago
view post
Post
4133
🔎 UX Crime Scene — every interface hides a crime.

Drop a screenshot of ANY website or app, and THE INSPECTOR — a film-noir detective — works it as a crime scene: he circles each UX flaw on the real pixels, names the charge, and files a verdict with a letter grade. A UX audit that plays like a detective thriller.

But the verdict is just the opening statement. Now it goes further:

⚖️ THE TRIAL — put the interface on trial. The guilty UI elements take the stand and defend themselves while the Inspector rules from the evidence.
🖼️ THE RECONSTRUCTION — one click and FLUX.2 Klein rebuilds the worst element FIXED, live. Before/after, on the real pixels.
🔊 THE VOICE — hear the verdict read aloud (Kokoro, local, no keys).
🚨 MOST WANTED — a public rogues' gallery. Book your case onto a shared board where the city's worst interfaces are ranked by their crimes. Booked by the public.

Three small models, all on Modal (scale-to-zero), none over 32B:
👁️ Qwen2.5-VL-7B (vision agent) · 🖼️ FLUX.2 Klein (reconstruction) · 🔊 Kokoro-82M (voice)

📊 Human-graded: 84% grounding / 92% valid charges.

▶️ Trailer: https://youtu.be/6u58YIEPrkA
📹 Full walkthrough: https://youtu.be/WyQbY0XJ_9E
🕵️ Try it: build-small-hackathon/ux-crime-scene

Built solo for #BuildSmallHackathon (Gradio × Hugging Face). Open the case — the Inspector is waiting.
Jiaqi-hkust 
posted an update 4 days ago
view post
Post
3927
🚀 Introducing Robust-U1: Teaching MLLMs to Self-Recover Corrupted Visual Content

Multimodal Large Language Models (MLLMs) have achieved impressive visual understanding, yet they remain highly brittle under real-world corruptions—noise, blur, compression artifacts, adverse weather.

Standard MLLMs suffer dramatic performance drops, and existing robustness solutions come with fundamental limits: black‑box feature alignment lacks interpretability, while white‑box text reasoning cannot restore the lost pixel‑level visual details. This raises a crucial question:

🧐 Can MLLMs recover corrupted visual content by themselves?

If the answer is yes, we can move beyond merely “compensating” for corruption and instead build a more intrinsic, generalizable form of resilience. Robust-U1 is our answer to that question.

💡 Paper: https://arxiv.org/abs/2606.08063
🔗 Code: github.com/jqtangust/Robust-U1
🌍 Demo: Jiaqi-hkust/Robust-U1

  • 1 reply
·
kingkw1 
posted an update about 14 hours ago
view post
Post
188
I built Read-Along AI for the Hugging Face Build Small Hackathon.

It is an offline-capable reading practice app for early readers: one short sentence at a time, tap-to-hear word help, record a read-aloud attempt, then get gentle feedback.

The goal is Backyard AI in the literal sense: a tool for real home reading practice, where feedback needs to be patient, developmentally fair, and private. A child’s voice should not need to leave the app just to practice “The dog ran fast.”

What makes it small-model native:

- Exact clean readings pass immediately.
- Close or ambiguous child-speech transcripts get a second look from a fine-tuned MiniCPM phonetic evaluator.
- Meaning-changing mistakes still fail closed, e.g. “blue hat” should not pass for “red hat.”
- Off the Grid Mode runs local ASR plus the MiniCPM GGUF evaluator through llama.cpp.
- Turbo Mode uses Modal endpoints for lower-latency ASR/TTS/evaluation.
- The UI is custom Gradio with a child-facing reading canvas, clickable words, progress feedback, and celebration on success.

Targeted tracks and badges:
Backyard AI, Off-Brand, Off the Grid, Llama Champion, Well-Tuned, Tiny Titan, Sharing is Caring, Field Notes.

Space:
build-small-hackathon/read-along-ai

Demo video:
[add YouTube URL after upload]

Repo:
https://github.com/kingkw1/read-along-ai

Built with Codex as the lead development partner.
YMRohit 
posted an update about 15 hours ago
view post
Post
249
A 1B model that writes GPU kernels you can trust

I fine-tuned OpenBMB's MiniCPM5-1B to write Triton GPU kernels, then let an immutable referee decide if they are real: compile, check correctness against PyTorch on adversarial inputs, time against eager, torch.compile, and torch.compile max-autotune, then block the known ways of gaming the benchmark.

The 1B setup beat torch.compile max-autotune in 12/12 independently seeded runs. The larger Qwen3.6-27B smith pushed the same referee loop further: 76 verified compiler-beating kernels on H200, with 69 surviving a 5-run stability gate and 7 kept as single-shot probes on unseen problems. On a 376-cell shape/dtype grid, the stability-gated kernels keep a 1.49x geomean, with about 10% of cells losing and reported per cell.

Honest bound: these are scheduling wins on memory-bound ops, not new algorithms or wins over cuBLAS/FlashAttention. The scarce thing is not the big model, it is the verifier it cannot fool.

Full write-up: https://huggingface.co/blog/YMRohit/ouroboros-kernel-mint
Try it: build-small-hackathon/ouroboros-kernel-mint
2-min demo: https://youtu.be/ViicZHktb-A

Built for #BuildSmallHackathon with MiniCPM, Qwen, Triton, Gradio, Codex, and Modal H200s.
etemiz 
posted an update about 21 hours ago